From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

Jianliang He   Siyu Chen 1 1 footnotemark: 1 Fengzhuo Zhang   Zhuoran Yang 3 3 footnotemark: 3 Equal contribution.Fudan University. Email: hejl20@fudan.edu.cn Yale University. Email: {siyu.chen.sc3226,zhuoran.yang}@yale.edu .National University of Singapore. Email: fzzhang@u.nus.edu .
Abstract

In this work, from a theoretical lens, we aim to understand why large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. To this end, consider a hierarchical reinforcement learning (RL) model where the LLM Planner and the Actor perform high-level task planning and low-level execution, respectively. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. Under proper assumptions on the pretraining data, we prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning. Additionally, we highlight the necessity for exploration beyond the subgoals derived from BAIL by proving that naively executing the subgoals returned by LLM leads to a linear regret. As a remedy, we introduce an ϵ italic-ϵ \epsilon -greedy exploration strategy to BAIL, which is proven to incur sublinear regret when the pretraining error is small. Finally, we extend our theoretical framework to include scenarios where the LLM Planner serves as a world model for inferring the transition model of the environment and to multi-agent settings, enabling coordination among multiple Actors.

1 Introduction

The advent of large language models (LLMs) such as GPT-4 (OpenAI,, 2023 ) and Llama 2 (Touvron et al.,, 2023 ) has marked a significant leap in artificial intelligence, thanks to their striking capabilities in understanding language and performing complex reasoning tasks. These capabilities of LLMs have led to the emergence of LLM-empowered agents (LLM Agents), where LLMs are used in conjunction with tools or actuators to solve decision-making problems in the physical world. LLM Agents have showcased promising empirical successes in a wide range of applications, including autonomous driving ( Wang et al., 2023b, ; Fu et al.,, 2024 ) , robotics (Brohan et al.,, 2023 ; Li et al., 2023a, ) , and personal assistance (Liu et al.,, 2023 ; Nottingham et al.,, 2023 ) . This progress signifies a crucial advancement in the creation of intelligent decision-making systems, distinguished by a high degree of autonomy and seamless human-AI collaboration.

LLMs only take natural languages as input. To bridge the language and physical domain, LLM-agents typically incorporate three critical components: an LLM Planner, a physical Actor, and a multimodal Reporter, functioning respectively as the brain, hands, and eyes of the LLM-agent, respectively. Specifically, upon receiving a task described by a human user, the LLM Planner breaks down the overall task into a series of subgoals. Subsequently, the Actor implements each subgoal in the physical world through a sequence of actions. Meanwhile, the Reporter monitors changes in the physical world and conveys this information back to the LLM Planner in natural language form. This dynamic interaction among Planner, Actor, and Reporter empowers LLM Agents to understand the environment, formulate informed decisions, and execute actions effectively, thus seamlessly integrating high-level linguistic subgoals with low-level physical task execution.

The revolutionary approach of LLM Agents represents a paradigm shift away from traditional learning-based decision-making systems. Unlike these conventional systems, LLM Agents are not tailored to any specific task. Instead, they rely on the synergy of their three distinct components—each trained separately and often for different objectives. In particular, the LLM Planner is trained to predict the next token in a sequence on vast document data. Moreover, when deployed to solve a task, the way to interact with the LLM Planner is via prompting with the LLM fixed. The Actor, as language-conditioned policies, can be trained by RL or imitation learning. Moreover, the Reporter, as a multimodal model, is trained to translate the physical states (e.g., images) into natural language. This unique configuration prompts critical research questions regarding the theoretical underpinnings of LLM Agents, particularly concerning their decision-making effectiveness.

Refer to caption
Figure 1: Overview of the Planner-Actor-Reporter (PAR) system as LLM Agents. Acting as a central controller, the Planner conducts the high-level planning by storing the history and reasoning through the iterative use of the ICL ability of LLMs, coupled with explorations. The Actor handles low-level planning and executes subgoals using pre-programmed skill sets, and the Reporter perceives and processes multimodal information from environment to reinforce the ongoing planning.

In this work, we make an initial step toward developing a theoretical framework for understanding the dynamics and effectiveness of LLM Agents. Specifically, we aim to answer the following questions: (a) What is a theoretical model for understanding the performance of LLM Agents? (b) How do pretrained LLMs solve decision-making problems in the physical world via prompting? (c) How does an LLM Agent address the exploration-exploitation tradeoff? (d) How do the statistical errors of the pretrained LLM and Reporter affect the overall performance of the LLM Agent?

To address Question (a) , we propose analyzing LLM Agents within a hierarchical reinforcement learning framework (Barto and Mahadevan,, 2003 ; Pateria et al.,, 2021 ) , positioning the LLM Planner and the Actor as policies operating within high-level POMDPs and low-level MDPs, respectively (§ 3.1 ). Both levels share the same state space—namely, the physical state—though the LLM Planner does not directly observe this state but instead receives a language-based description from the Reporter, effectively navigating a POMDP. The action space of the high-level POMDP is the set of language subgoals. Meanwhile, the state transition kernel is determined by the pretrained Actor, and thus is associated with a variable z 𝑧 z that summarizes its dependency on low-level Actor. Such variable is unknown to LLM Planner. After pretraining, without prior knowledge of the Actor’s quality or the physical environment, the LLM Planner attempts to solve the high-level POMDP by iteratively generating a sequence of subgoals based on feedback from the Reporter via prompting. Under this framework, the overall performance of the LLM Agent can be captured by the regret in terms of finding the optimal policy of the hierarchical RL problem in the online setting (§ 3.2 ).

Furthermore, to answer Question (b) , we prove that when the pretraining data includes a mixture of expert trajectories, during the prompting stage, the pretrained LLM Planner essentially performs Bayesian aggregated imitation learning (BAIL) through in-context learning (Theorem 4.2 ). This process involves constructing a posterior distribution over the hidden parameter z 𝑧 z of the transition kernel, followed by generating subgoals that emulate a randomly selected expert policy, weighted according to this posterior distribution. Such a Bayesian learning mechanism is encoded by the LLM architecture and is achieved through prompting.

However, since the LLM has no prior knowledge of the physical environment, it needs to guide the Actor to explore the physical environment. We prove that merely adhering to BAIL-derived subgoals can lead to the inadequate exploration, resulting in a linear regret (Proposition 4.3 ). To mitigate this, i.e., Question (c) , we introduce an ϵ italic-ϵ \epsilon -greedy exploration strategy, which occasionally deviates from BAIL subgoals in favor of exploration, significantly enhancing learning efficacy by ensuring a sublinear regret (Theorem 4.6 ). Specifically, to address Question (d) we establish that the regret is bounded by a sum of two terms (Theorem 5.7 ): a T 𝑇 \sqrt{T} -regret related to the number of episodes the LLM Agent is deployed to the hierarchical RL problem, and an additional term representing the statistical error from pretraining the LLM Planner and Reporter via maximum likelihood estimation (MLE) and contrastive learning, respectively (Theorem 2 , 5.5 ).

Finally, we extend our analysis to scenarios where the Planner utilizes the LLM as world model for inferring the upper-level POMDP’s transition model via Bayesian model aggregation (Proposition B.1 , Corollary B.3 ). Our theoretical framework also accommodates a multi-agent context, where the LLM Planner coordinates with a collaborative team of low-level actors (Corollary B.4 ).

2 Preliminaries and Related Works

Large Language Models.

The Large Language Models (LLMs) such as ChatGPT (Brown et al.,, 2020 ) , GPT-4 (OpenAI,, 2023 ) , Llama (Touvron et al.,, 2023 ) , and Gemini (Team et al.,, 2023 ) , are pretrained on vast text corpora to predict in an autoregressive manner. Starting from an initial token 1 𝔏 d subscript 1 𝔏 superscript 𝑑 \ell_{1}\in\mathfrak{L}\subseteq\mathbb{R}^{d} , where d 𝑑 d denotes the dimension of token vector and 𝔏 𝔏 \mathfrak{L} denotes the language space, the LLM, with parameters θ Θ 𝜃 Θ \theta\in\Theta , predicts the next token with t + 1 𝙻𝙻𝙼 θ ( | S t ) \ell_{t+1}\sim\mathtt{LLM}_{\theta}(\cdot\hskip 1.42262pt|\hskip 1.42262ptS_{t}) , where S t = ( 1 , , t ) subscript 𝑆 𝑡 subscript 1 subscript 𝑡 S_{t}=(\ell_{1},\dots,\ell_{t}) and t 𝑡 t\in\mathbb{N} . Each token t 𝔏 subscript 𝑡 𝔏 \ell_{t}\in\mathfrak{L} specifies a word or word’s position, and the token sequence S t subscript 𝑆 𝑡 S_{t} resides in the space of token sequences 𝔏 superscript 𝔏 \mathfrak{L}^{*} . Such an autoregressive generating process terminates when the stop sequence token is generated.

In-Context Learning.

LLMs haved exhibited robust reasoning capabilities and a crucial aspect of their reasoning prowess is the in-context learning (ICL) ability. This ability is further enhanced through additional training stages (Iyer et al.,, 2022 ) , careful selection and arrangement of informative demonstrations (Liu et al.,, 2021 ; Kim et al.,, 2022 ) , explicit instruction (Honovich et al.,, 2022 ) , and use of prompts to stimulate chain of thoughts ( Wei et al., 2022b, ) . Unlike fine-tuned models customized for specific tasks, LLMs showcase comparable capabilities by learning from the informative prompts (Li et al.,, 2022 ; Liu et al., 2022b, ) . Assume that prompt, denoted by 𝚙𝚝 t = ( 1 , , t ) 𝔏 subscript 𝚙𝚝 𝑡 subscript 1 subscript 𝑡 superscript 𝔏 \mathtt{pt}_{t}=(\ell_{1},\dots,\ell_{t})\in\mathfrak{L}^{*} , is generated based on a latent variable z 𝒵 𝑧 𝒵 z\in\mathcal{Z} autoregressively. The token follows a generating distribution such that t ( | 𝚙𝚝 t 1 , z ) \ell_{t}\sim\mathbb{P}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{t-1},z) and 𝚙𝚝 t = ( 𝚙𝚝 t 1 , t ) subscript 𝚙𝚝 𝑡 subscript 𝚙𝚝 𝑡 1 subscript 𝑡 \mathtt{pt}_{t}=(\mathtt{pt}_{t-1},\ell_{t}) , where 𝒵 𝒵 \mathcal{Z} denotes the space of hidden information or concepts. The latent structure is commonly employed in language models, including topic models like LDA (Blei et al.,, 2003 ) , BERT (Devlin et al.,, 2018 ) , generative models like VAE (Kusner et al.,, 2017 ) , T5 (Raffel et al.,, 2020 ) , and is also widely adopted in the theoretical analysis of ICL (Xie et al.,, 2021 ; Zhang et al.,, 2023 ) . Theoretical understanding of ICL is an active area of research. Since real-world datasets used for LLM pretraining are difficult to model theoretically and are very large, ICL has also been studied in stylized setups (Xie et al.,, 2021 ; Garg et al.,, 2022 ; Chan et al.,, 2022 ; Hahn and Goyal,, 2023 ; Zhang et al.,, 2023 ) . In this paper, we build upon the framework attributing the ICL capability to Bayesian inference (Xie et al.,, 2021 ; Jiang,, 2023 ; Zhang et al.,, 2023 ) , which posits that the pretrained LLMs predict the next token with probability by aggregating the generating distribution concerning latent variable z 𝒵 𝑧 𝒵 z\in\mathcal{Z} over the posterior distribution. Moreover, a series of practical experiments, including Wang et al., 2023a ; Ahuja et al., ( 2023 ) , provide empirical support for this Bayesian statement.

LLM Agents.

LLMs, as highlighted in OpenAI, ( 2023 ) , are powerful tools for the task planning ( Wei et al., 2022a, ; Hu and Shu,, 2023 ) . The success of LLM agent marks a shift from task-specific policies to a pretrain-finetune-prompt paradigm. By breaking down the complex tasks into subgoals, LLM Agent facilitates the effective zero-shot resource allocation across environments. For instance, envision a scenario where a robotic arm is tasked with “ move a teapot from the stove to a shelf ”, a task for which the robotic arm may not be pretrained. However, leveraging LLMs allows the decomposition of the task into a sequence of executable subgoals: “ grasp the teapot ”, “ lift the teapot ”, “ move the teapot to the shelf ”, and “ release the teapot ”.

In the conventional task-planning and decision-making problems, symbolic planners have commonly been employed to transform them into search problems (Bonet and Geffner,, 2001 ; Ghallab et al.,, 2004 ) or to design distinct reinforcement learning or control policies for each specific scenario. Recent empirical studies have shifted towards leveraging LLMs as symbolic planners in various domains, including robotic control (Mandi et al.,, 2023 ; Brohan et al.,, 2023 ; Li et al., 2023a, ; Du et al.,, 2023 ) , autonomous driving ( Wang et al., 2023b, ; Fu et al.,, 2024 ) and personal decision assistance (Li et al.,, 2022 ; Lin et al., 2023a, ; Hu et al.,, 2023 ; Liu et al.,, 2023 ; Nottingham et al.,, 2023 ) . Another recent line of research has been dedicated to devising diverse prompting schemes to enhance the reasoning capability of LLMs ( Wei et al., 2022b, ; Yao et al., 2023a, ; Yao et al., 2023b, ; Hao et al.,, 2023 ) . Despite the considerable empirical success, there is a lack of comprehensive theoretical analysis on LLM Agent. In this paper, we formalize this approach into a hierarchical LLM-empowered planning framework and provide a theoretical analysis of its performance. Two recent works by Liu et al., ( 2023 ) and Lee et al., ( 2023 ) also aim to establish provable algorithms for planning with LLMs or decision-pretrained Transformers (DPT). In comparison, we discuss both the plausibility of taking LLMs as a subgoal generator (Lee et al.,, 2023 ) and simulated world model (Liu et al.,, 2023 ) . Furthermore, we provide a statistical guarantee for pretrained models and conduct a detailed examination of the algorithm’s performance in practical settings, bringing our analysis closer to real-world applications.

3 Theoretical Framework for LLM Agents

To formalize the architecture of LLM Agents, we propose a general theoretical framework— P lanner- A ctor- R eporter (PAR) system. Furthermore, the problem is modeled as a hierarchical RL problem (Pateria et al.,, 2021 ) . Specifically, the Planner, empowered by LLMs, conducts high-level task planning within the language space; the Actor, pretrained before deployment, undertakes low-level motion planning within the physical world; and the Reporter, equipped with a sensor to sense the physical environment, processes the information and feeds it back to the Planner, bridging the gap between language space and the physical world (see § 3.1 ). Additionally, we present the performance metric and pretraining methods of LLMs for the Planner and translators for the Reporter in § 3.2 .

3.1 Planner-Actor-Reporter System

In this section, we delve into details of the PAR system under Hierarchical Markov Decision Process ( HMDP ). At the high level, the Planner empowered by LLM handles task planning by decomposing tasks into subgoals to solve a language-conditioned Partially Observable Markov Decision Process ( POMDP ) with a finite horizon H 𝐻 H . At the low level, the Actor translates these subgoals into the actionable steps in the physical world to handle a language-conditioned Markov Decision Process ( MDP ) with a finite horizon H a subscript 𝐻 𝑎 H_{a} 1 1 1 Throughout the paper, we use the notation ¯ ¯ \bar{\cdot} to distinguish low-level elements from their high-level counterparts. . Please refer to the right panel of Figure 1 for a detailed example of LLM Agent, and see Figure 2 for an overview of the hierarchical interactive process.

Low-level MDP.

Let 𝒢 𝔏 𝒢 𝔏 \mathcal{G}\subseteq\mathfrak{L} be the space of language subgoals, 𝒮 𝒮 {\mathcal{S}} and 𝒜 𝒜 \mathcal{A} respectively denote the space of physical states and actions. At high-level step h h , the low-level MDP is specified by a transition kernel 𝕋 h = { 𝕋 h , h ¯ } h ¯ [ H a ] subscript 𝕋 subscript subscript 𝕋 ¯ ¯ delimited-[] subscript 𝐻 𝑎 \mathbb{T}_{h}=\{\mathbb{T}_{h,\bar{h}}\}_{\bar{h}\in[H_{a}]} and the rewards that depends on subgoal g 𝒢 𝑔 𝒢 g\in\mathcal{G} . Following this, the Actor is modelled as a language-conditioned policy μ = { μ g } g 𝒢 𝜇 subscript subscript 𝜇 𝑔 𝑔 𝒢 \mu=\{\mu_{g}\}_{g\in\mathcal{G}} , where μ g = { μ h ¯ ( | , g ) } h ¯ [ H a ] \mu_{g}=\{\mu_{\bar{h}}(\cdot|\cdot,g)\}_{\bar{h}\in[H_{a}]} and μ h ¯ : 𝒮 × 𝒢 Δ ( 𝒜 ) : subscript 𝜇 ¯ maps-to 𝒮 𝒢 Δ 𝒜 \mu_{\bar{h}}:{\mathcal{S}}\times\mathcal{G}\mapsto\Delta(\mathcal{A}) . Assume that the Actor stops at step H a + 1 subscript 𝐻 𝑎 1 H_{a}+1 , regardless of the subgoal achievement. Subsequently, the Planner receives the observation of the current state s ¯ h , H a + 1 subscript ¯ 𝑠 subscript 𝐻 𝑎 1 \bar{s}_{h,H_{a}+1} from the Reporter, and sends a new subgoal to the Actor based on the historical feedback.

High-level POMDP.

Suppose that a low-level episode corresponds to a single high-level action of the Planner. Thus, the high-level POMDP reuses the physical state space 𝒮 𝒮 {\mathcal{S}} as the state space, but takes the subgoal space 𝒢 𝒢 \mathcal{G} as the action space instead. Following this, the high-level transition kernel is jointly determined by the low-level policy μ 𝜇 \mu and the physical transition kernel 𝕋 𝕋 \mathbb{T} such that

z , h ( s | s , g ) = ( s ¯ h , H a + 1 = s | s ¯ h , 1 = s , a h , 1 : h ¯ μ g , s ¯ h , 2 : h ¯ + 1 𝕋 h ) , \displaystyle\mathbb{P}_{z,h}(s^{\prime}\hskip 1.42262pt|\hskip 1.42262pts,g)=\mathbb{P}\big{(}\bar{s}_{h,H_{a}+1}=s^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\bar{s}_{h,1}=s,a_{h,1:\bar{h}}\sim\mu_{g},\bar{s}_{h,2:\bar{h}+1}\sim\mathbb{T}_{h}\big{)}, (3.1)

where we write z = ( 𝕋 , μ ) 𝑧 𝕋 𝜇 z=(\mathbb{T},\mu) . Since the LLM-empowered Planner cannot directly process the physical states, it relies on some (partial) observations generated by the Reporter. Specifically, let o h 𝒪 subscript 𝑜 𝒪 o_{h}\in\mathcal{O} describe the physical state s h 𝒮 subscript 𝑠 𝒮 s_{h}\in{\mathcal{S}} in language through a translation distribution 𝕆 : 𝒪 Δ ( 𝒮 ) : 𝕆 maps-to 𝒪 Δ 𝒮 \mathbb{O}:\mathcal{O}\mapsto\Delta({\mathcal{S}}) , where 𝒪 𝔏 𝒪 𝔏 \mathcal{O}\subseteq\mathfrak{L} denotes the space of observations. At each step h [ H ] delimited-[] 𝐻 h\in[H] , a reward r h ( o h , ω ) [ 0 , 1 ] subscript 𝑟 subscript 𝑜 𝜔 0 1 r_{h}(o_{h},\omega)\in[0,1] is obtained, which depends on both the observation and the task ω Ω 𝜔 Ω \omega\in\Omega assigned by human users. Here, Ω 𝔏 Ω 𝔏 \Omega\subseteq\mathfrak{L} denotes the space of potential tasks in language.

Interactive Protocol.

The Planner aims to determine a sequence of subgoal { g h } h [ H ] subscript subscript 𝑔 delimited-[] 𝐻 \{g_{h}\}_{h\in[H]} such that when the Actor is equipped with policy π = { π h } h [ H ] 𝜋 subscript subscript 𝜋 delimited-[] 𝐻 \pi=\{\pi_{h}\}_{h\in[H]} , these subgoals maximize the expected sum of rewards. During task planning, the Planner must infer both Actor’s intention, i.e., policy μ 𝜇 \mu , and the environment, i.e., physical transition kernel 𝕋 𝕋 \mathbb{T} , from the historical information. Thus, z 𝑧 z constitutes all the latent information to the high-level Planner, and denote 𝒵 𝒵 \mathcal{Z} as the space of all potential latent variables with | 𝒵 | < 𝒵 |\mathcal{Z}|<\infty .

Refer to caption
Figure 2: Illustration of structure of HMDP . The low-level MDP is featured by transition kernel 𝕋 𝕋 \mathbb{T} , which characterizes the dynamics of the physical environment. The high-level transition is a result of a sequence of low-level actions in the physical environment, guided by policies μ = { μ g } g 𝒢 𝜇 subscript subscript 𝜇 𝑔 𝑔 𝒢 \mu=\{\mu_{g}\}_{g\in\mathcal{G}} . Thus, high-level POMDP incorporates latent information z = ( 𝕋 , μ ) 𝑧 𝕋 𝜇 z=(\mathbb{T},\mu) originated from the low-level.

To summarize, the interactive protocol is as below: at the beginning of each episode t 𝑡 t , Planner receives a task ω t subscript 𝜔 𝑡 \omega_{t} . At step h h , each module follows:

Module 1: Planner.

After collecting o h t superscript subscript 𝑜 𝑡 o_{h}^{t} from Reporter, the Planner leverages LLMs for recommendations on task decomposition, and the policy is denoted by π h , 𝙻𝙻𝙼 t : 𝒯 × ( 𝒪 × 𝒢 ) h 1 × 𝒪 × Ω Δ ( 𝒢 ) : subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 maps-to superscript 𝒯 superscript 𝒪 𝒢 1 𝒪 Ω Δ 𝒢 \pi^{t}_{h,\mathtt{LLM}}:{\mathcal{T}}^{*}\times(\mathcal{O}\times\mathcal{G})^{h-1}\times\mathcal{O}\times\Omega\mapsto\Delta(\mathcal{G}) , where 𝒯 superscript 𝒯 {\mathcal{T}}^{*} represents the space of the trajectory sequence with arbitrary length. LLM’s recommendation is obtained by invoking the ICL ability with the history-dependent prompt:

𝚙𝚝 h t = t { ω t , τ h t } , t = i = 1 t 1 { ω i , τ H i } , formulae-sequence superscript subscript 𝚙𝚝 𝑡 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝜏 𝑡 subscript 𝑡 superscript subscript 𝑖 1 𝑡 1 superscript 𝜔 𝑖 superscript subscript 𝜏 𝐻 𝑖 \mathtt{pt}_{h}^{t}=\mathcal{H}_{t}\cup\left\{\omega^{t},\tau_{h}^{t}\right\},\quad\mathcal{H}_{t}=\bigcup_{i=1}^{t-1}\left\{\omega^{i},\tau_{H}^{i}\right\}, (3.2)

where t 𝒯 subscript 𝑡 superscript 𝒯 \mathcal{H}_{t}\in{\mathcal{T}}^{*} denotes the historical context and τ h t = { o 1 t , g 1 t , , o h t } superscript subscript 𝜏 𝑡 superscript subscript 𝑜 1 𝑡 superscript subscript 𝑔 1 𝑡 superscript subscript 𝑜 𝑡 \tau_{h}^{t}=\{o_{1}^{t},g_{1}^{t},\dots,o_{h}^{t}\} is the trajectory until h h -th step. In the PAR system, Planner retains autonomy and is not obligated to follow LLM’s recommendations. Let π h t subscript superscript 𝜋 𝑡 \pi^{t}_{h} be the Planner’s policy, which partially leverages the LLM’s recommendation π h , 𝙻𝙻𝙼 t ( | τ h t , ω t ) = 𝙻𝙻𝙼 θ ( | 𝚙𝚝 h t ) \pi^{t}_{h,\mathtt{LLM}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})=\mathtt{LLM}_{\theta}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}) . The Planner selects g h t π h t ( | τ h t , ω t ) g_{h}^{t}\sim\pi_{h}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}) , and sends it to the Actor.

Module 2: Actor.

Upon receiving g h t superscript subscript 𝑔 𝑡 g_{h}^{t} from Planner, the Actor plans to implement g h t superscript subscript 𝑔 𝑡 g_{h}^{t} in physical world with pretrained skill sets, denoted by a subgoal-conditioned policy μ = { μ g } g 𝒢 𝜇 subscript subscript 𝜇 𝑔 𝑔 𝒢 \mu=\{\mu_{g}\}_{g\in\mathcal{G}} . A sequence of actions { a h , h ¯ } h ¯ [ H a ] subscript subscript 𝑎 ¯ ¯ delimited-[] subscript 𝐻 𝑎 \{a_{h,\bar{h}}\}_{\bar{h}\in[H_{a}]} is executed, where the dynamics follows a h , h ¯ μ h ¯ ( | s ¯ h , h ¯ , g h t ) a_{h,\bar{h}}\sim\mu_{\bar{h}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\bar{s}_{h,\bar{h}},g_{h}^{t}) and s ¯ h , h ¯ + 1 𝕋 h , h ¯ ( | s ¯ h , h ¯ , a h , h ¯ ) \bar{s}_{h,\bar{h}+1}\sim\mathbb{T}_{h,\bar{h}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\bar{s}_{h,\bar{h}},a_{h,\bar{h}}) starting from s ¯ h , 1 = s h t subscript ¯ 𝑠 1 superscript subscript 𝑠 𝑡 \bar{s}_{h,1}=s_{h}^{t} . The low-level episode concludes at s h + 1 t = s ¯ h , H a + 1 superscript subscript 𝑠 1 𝑡 subscript ¯ 𝑠 subscript 𝐻 𝑎 1 s_{h+1}^{t}=\bar{s}_{h,H_{a}+1} .

Module 3: Reporter.

After the low-level episode concludes, the Reporter collects and reports the current state s h t superscript subscript 𝑠 𝑡 s_{h}^{t} via observation o h + 1 t subscript superscript 𝑜 𝑡 1 o^{t}_{h+1} generated from 𝕆 γ ( | s h + 1 t ) \mathbb{O}_{\gamma}(\cdot\hskip 1.42262pt|\hskip 1.42262pts^{t}_{h+1}) , where 𝕆 γ : 𝒮 Δ ( 𝒪 ) : subscript 𝕆 𝛾 maps-to 𝒮 Δ 𝒪 \mathbb{O}_{\gamma}:\mathcal{S}\mapsto\Delta(\mathcal{O}) denotes the distribution of the pretrained translator. Subsequently, the observation o h + 1 t superscript subscript 𝑜 1 𝑡 o_{h+1}^{t} of the current state is sent back to the Planner, reinforcing to the ongoing task planning.

The strength of the PAR system lies in its resemblance to RL (Sutton and Barto,, 2018 ) , allowing the Planner to iteratively adjust its planning strategy based on feedback from the Reporter. Moreover, the Reporter empowers the system to process the real-time information and the integration of multiple modalities of raw data like RGB, images, LiDAR, audio, and text ( Li et al., 2023b, ; Xu et al.,, 2023 ) . The Actor’s skill sets can effectively be pretrained using the goal-conditioned RL (Chane-Sane et al.,, 2021 ; Liu et al., 2022a, ) , language-to-environment grounding (Brohan et al.,, 2023 ; Huang et al.,, 2022 ) or pre-programmed manually (Singh et al.,, 2023 ) .

3.2 Performance Metric and Pretraining

Performance Metric.

In this paper, we focus on the performance of the high-level Planner , and regard the low-level Actor as an autonomous agent that can use the pretrained skill sets following a fixed policy. For any latent variable z 𝒵 𝑧 𝒵 z\in\mathcal{Z} and policy π = { π h } h [ H ] 𝜋 subscript subscript 𝜋 delimited-[] 𝐻 \pi=\{\pi_{h}\}_{h\in[H]} with π h : ( 𝒪 × 𝒢 ) h 1 × 𝒪 × Ω Δ ( 𝒢 ) : subscript 𝜋 maps-to superscript 𝒪 𝒢 1 𝒪 Ω Δ 𝒢 \pi_{h}:(\mathcal{O}\times\mathcal{G})^{h-1}\times\mathcal{O}\times\Omega\mapsto\Delta(\mathcal{G}) , the value function is defined as

𝒥 z ( π , ω ) := 𝔼 π [ h = 1 H r h ( o h , ω ) ] , assign subscript 𝒥 𝑧 𝜋 𝜔 subscript 𝔼 𝜋 delimited-[] superscript subscript 1 𝐻 subscript 𝑟 subscript 𝑜 𝜔 \mathcal{J}_{z}(\pi,\omega):=\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}r_{h}\left(o_{h},\omega\right)\right], (3.3)

where the expectation is taken concerning the initial state s 1 ρ similar-to subscript 𝑠 1 𝜌 s_{1}\sim\rho , policy π 𝜋 \pi , ground-truth translation distribution 𝕆 𝕆 \mathbb{O} , and transition kernel z subscript 𝑧 \mathbb{P}_{z} . For all ( z , ω ) 𝒵 × Ω 𝑧 𝜔 𝒵 Ω (z,\omega)\in\mathcal{Z}\times\Omega , there exists an optimal policy π z ( ω ) = argmax π Π 𝒥 z ( π , ω ) superscript subscript 𝜋 𝑧 𝜔 subscript argmax 𝜋 Π subscript 𝒥 𝑧 𝜋 𝜔 \pi_{z}^{*}\left(\omega\right)={\rm argmax}_{\pi\in\Pi}\mathcal{J}_{z}(\pi,\omega) , where Π = { π = { π h } h [ H ] , π h : ( 𝒪 × 𝒢 ) h 1 × 𝒪 × Ω Δ ( 𝒢 ) } Π conditional-set 𝜋 subscript subscript 𝜋 delimited-[] 𝐻 subscript 𝜋 maps-to superscript 𝒪 𝒢 1 𝒪 Ω Δ 𝒢 \Pi=\{\pi=\{\pi_{h}\}_{h\in[H]},\pi_{h}:(\mathcal{O}\times\mathcal{G})^{h-1}\times\mathcal{O}\times\Omega\mapsto\Delta(\mathcal{G})\} .

To characterize the performance under practical setting, we denote 𝒥 ^ z ( π , ω ) subscript ^ 𝒥 𝑧 𝜋 𝜔 \mathcal{\widehat{J}}_{z}(\pi,\omega) as the value function concerning the pretrained translator 𝕆 γ ^ subscript 𝕆 ^ 𝛾 \mathbb{O}_{\widehat{\gamma}} , and for all ω Ω 𝜔 Ω \omega\in\Omega , let π ^ z ( ω ) = argmax π Π 𝒥 ^ z ( π , ω ) superscript subscript ^ 𝜋 𝑧 𝜔 subscript argmax 𝜋 Π subscript ^ 𝒥 𝑧 𝜋 𝜔 \widehat{\pi}_{z}^{*}\left(\omega\right)={\rm argmax}_{\pi\in\Pi}\ \widehat{\mathcal{J}}_{z}(\pi,\omega) be the optimal policy in practice. Then, the regret under practical setting is defined as

Reg z ( T ) := t = 1 T 𝔼 t [ 𝒥 ^ z ( π ^ z , ω t ) 𝒥 ^ z ( π ^ t , ω t ) ] , assign subscript Reg 𝑧 𝑇 superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript ^ 𝒥 𝑧 superscript subscript ^ 𝜋 𝑧 superscript 𝜔 𝑡 subscript ^ 𝒥 𝑧 superscript ^ 𝜋 𝑡 superscript 𝜔 𝑡 {\rm Reg}_{z}(T):=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{z}(\widehat{\pi}^{t},\omega^{t})\right], (3.4)

where { π ^ t } t [ T ] subscript superscript ^ 𝜋 𝑡 𝑡 delimited-[] 𝑇 \{\widehat{\pi}^{t}\}_{t\in[T]} represents the Planner’s policy empowered by a pretrained 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} and the expectation is taken with respect to the context t subscript 𝑡 \mathcal{H}_{t} defined in ( 3.2 ) generated by taking { π ^ i } i < t subscript superscript ^ 𝜋 𝑖 𝑖 𝑡 \{\widehat{\pi}^{i}\}_{i<t} sequentially. Here, we focus on the performance when the Planner collaborates with a pretrained PAR system in an environment characterized by z 𝑧 z and pretrained Reporter. Our goal is to design a sample-efficient algorithm that achieves a sublinear regret, i.e., Reg z ( T ) = o ( T ) subscript Reg 𝑧 𝑇 𝑜 𝑇 {\rm Reg}_{z}(T)=o(T) .

Pretraining Dataset Collection.

The pretraining dataset consists of N p subscript 𝑁 p N_{\rm p} independent samples with T p subscript 𝑇 p T_{\rm p} episodes such that 𝒟 = { D n } n [ N p ] 𝒟 subscript subscript 𝐷 𝑛 𝑛 delimited-[] subscript 𝑁 p \mathcal{D}=\{D_{n}\}_{n\in[N_{\rm p}]} , where D n = { z } { ω t , τ H t , g 1 : H t , , s 1 : H t } t [ T p ] subscript 𝐷 𝑛 𝑧 subscript superscript 𝜔 𝑡 superscript subscript 𝜏 𝐻 𝑡 superscript subscript 𝑔 : 1 𝐻 𝑡 superscript subscript 𝑠 : 1 𝐻 𝑡 𝑡 delimited-[] subscript 𝑇 p D_{n}=\{z\}\cup\{\omega^{t},\tau_{H}^{t},g_{1:H}^{t,*},s_{1:H}^{t}\}_{t\in[T_{\rm p}]} . For each sample, z 𝒫 𝒵 similar-to 𝑧 subscript 𝒫 𝒵 z\sim\mathcal{P}_{\mathcal{Z}} specifies a low-level MDP with language-conditioned policies and ω t 𝒫 Ω similar-to superscript 𝜔 𝑡 subscript 𝒫 Ω \omega^{t}{\sim}\mathcal{P}_{\Omega} specifies the sequence of high-level tasks. Here, 𝒫 𝒵 subscript 𝒫 𝒵 \mathcal{P}_{\mathcal{Z}} and 𝒫 Ω subscript 𝒫 Ω \mathcal{P}_{\Omega} denote the prior distributions. We assume that the joint distribution of each data point D 𝐷 D in the dataset, denoted by 𝒟 subscript 𝒟 \mathbb{P}_{\mathcal{D}} , follows that:

𝒟 ( D ) subscript 𝒟 𝐷 \displaystyle\mathbb{P}_{\mathcal{D}}(D) = 𝒫 𝒵 ( z ) t = 1 T p 𝒫 Ω ( ω t ) h = 1 H π z , h ( g h t , | τ h t , ω t ) absent subscript 𝒫 𝒵 𝑧 superscript subscript product 𝑡 1 subscript 𝑇 p subscript 𝒫 Ω superscript 𝜔 𝑡 superscript subscript product 1 𝐻 subscript superscript 𝜋 𝑧 conditional superscript subscript 𝑔 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle=\mathcal{P}_{\mathcal{Z}}(z)\cdot\prod_{t=1}^{T_{\rm p}}\mathcal{P}_{\Omega}(\omega^{t})\cdot\prod_{h=1}^{H}\pi^{*}_{z,h}(g_{h}^{t,*}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})
𝕆 ( o h t | s h t ) π h b ( g h t | τ h t , ω t ) z , h ( s h + 1 t | s h t , g h t ) , absent 𝕆 conditional superscript subscript 𝑜 𝑡 superscript subscript 𝑠 𝑡 subscript superscript 𝜋 𝑏 conditional superscript subscript 𝑔 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 subscript 𝑧 conditional superscript subscript 𝑠 1 𝑡 superscript subscript 𝑠 𝑡 superscript subscript 𝑔 𝑡 \displaystyle\quad\cdot\mathbb{O}(o_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pts_{h}^{t})\cdot\pi^{b}_{h}(g_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})\cdot\mathbb{P}_{z,h}(s_{h+1}^{t}\hskip 1.42262pt|\hskip 1.42262pts_{h}^{t},g_{h}^{t}), (3.5)

where π b = { π h b } h [ H ] superscript 𝜋 𝑏 subscript subscript superscript 𝜋 𝑏 delimited-[] 𝐻 \pi^{b}=\{\pi^{b}_{h}\}_{h\in[H]} is the behavior policy that features how the contextual information is collected, and additionally the label, i.e., optimal subgoal, is sampled from the optimal policy π z subscript superscript 𝜋 𝑧 \pi^{*}_{z} by experts. Subsequently, the latent information z 𝑧 z is hidden from the context.

LLM Pretraining.

To pretrain LLMs, we adopt a supervised learning approach concerning the transformer structure, aligning with the celebrated LLMs such as BERT and GPT (Devlin et al.,, 2018 ; Brown et al.,, 2020 ) . Specifically, the pretraining data is constructed based on 𝒟 𝒟 \mathcal{D} . For clarity, we extract the language data without expert knowledge and write the collected data into a sequence of ordered tokens, i.e., sentences or paragraphs. For the n 𝑛 n -th sample D n subscript 𝐷 𝑛 D_{n} , we write

( 1 n , , T ¯ p n ) := ( ω n , t , o 1 n , t , g 1 n , t , , o H 1 n , t , g H 1 n , t , o H n , t ) t [ T p ] , assign superscript subscript 1 𝑛 subscript superscript 𝑛 subscript ¯ 𝑇 p subscript superscript 𝜔 𝑛 𝑡 superscript subscript 𝑜 1 𝑛 𝑡 superscript subscript 𝑔 1 𝑛 𝑡 superscript subscript 𝑜 𝐻 1 𝑛 𝑡 superscript subscript 𝑔 𝐻 1 𝑛 𝑡 superscript subscript 𝑜 𝐻 𝑛 𝑡 𝑡 delimited-[] subscript 𝑇 p \displaystyle(\ell_{1}^{n},\dots,\ell^{n}_{\bar{T}_{\rm p}}):=\left(\omega^{n,t},o_{1}^{n,t},g_{1}^{n,t},\dots,o_{H-1}^{n,t},g_{H-1}^{n,t},o_{H}^{n,t}\right)_{t\in[T_{\rm p}]}, (3.6)

with a length of T ¯ p = 2 H T p subscript ¯ 𝑇 p 2 𝐻 subscript 𝑇 p \bar{T}_{\rm p}=2HT_{\rm p} , which contains T p subscript 𝑇 p T_{\rm p} episodes with one task, H 𝐻 H observations and H 1 𝐻 1 H-1 subgoals each. Following this, LLM’s pretraining dataset is autoregressively constructed with the expert guidance, denoted by 𝒟 𝙻𝙻𝙼 = { ( ~ t n , S t n ) } ( n , t ) [ N p ] × [ T ¯ p ] subscript 𝒟 𝙻𝙻𝙼 subscript superscript subscript ~ 𝑡 𝑛 superscript subscript 𝑆 𝑡 𝑛 𝑛 𝑡 delimited-[] subscript 𝑁 p delimited-[] subscript ¯ 𝑇 p \mathcal{D}_{\mathtt{LLM}}=\{(\tilde{\ell}_{t}^{n},S_{t}^{n})\}_{(n,t)\in[N_{\rm p}]\times[\bar{T}_{\rm p}]} , where S t + 1 n = ( S t n , t n ) superscript subscript 𝑆 𝑡 1 𝑛 superscript subscript 𝑆 𝑡 𝑛 subscript superscript 𝑛 𝑡 S_{t+1}^{n}=(S_{t}^{n},\ell^{n}_{t}) and let

{ ~ t n = g h n , t , if t = 2 H ( t 1 ) + 2 h + 1 , ~ t n = g h n , t otherwise . \left\{\begin{aligned} &\tilde{\ell}_{t^{\prime}}^{n}=g_{h}^{n,t,*}\hskip 11.38092pt\text{if~{}}t^{\prime}=2H(t-1)+2h+1,\\ &\tilde{\ell}_{t^{\prime}}^{n}=g_{h}^{n,t}\hskip 18.49411pt\text{otherwise}.\end{aligned}\right.

In other words, when pretraining to predict the next subgoal, we replace the one sampled from the behavior policy with the one from the optimal policy. In practice, sentences with expert knowledge can be collected from online knowledge platforms such as Wikipedia (Merity et al.,, 2016 ; Reid et al.,, 2022 ) . Following the pretraining algorithm of BERT and GPT, the objective is to minimize the cross-entropy loss, which can be summarized as θ ^ = argmin θ Θ CE ( θ ; 𝒟 𝙻𝙻𝙼 ) ^ 𝜃 subscript argmin 𝜃 Θ subscript CE 𝜃 subscript 𝒟 𝙻𝙻𝙼 \widehat{\theta}={\rm argmin}_{\theta\in\Theta}\ \mathcal{L}_{\mathrm{CE}}(\theta;\mathcal{D}_{\mathtt{LLM}}) with

CE ( θ ; 𝒟 𝙻𝙻𝙼 ) := 𝔼 ^ 𝒟 𝙻𝙻𝙼 [ log 𝙻𝙻𝙼 θ ( | S ) ] , assign subscript CE 𝜃 subscript 𝒟 𝙻𝙻𝙼 subscript ^ 𝔼 subscript 𝒟 𝙻𝙻𝙼 delimited-[] subscript 𝙻𝙻𝙼 𝜃 conditional 𝑆 \displaystyle\mathcal{L}_{\mathrm{CE}}(\theta;\mathcal{D}_{\mathtt{LLM}}):=\mathbb{\widehat{E}}_{\mathcal{D}_{\mathtt{LLM}}}\left[-\log\mathtt{LLM}_{\theta}(\ell\hskip 1.42262pt|\hskip 1.42262ptS)\right], (3.7)

and 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} is the pretrained LLM by algorithm in ( 3.7 ). More details are deferred to § 5.1 .

Translator Pretraining.

To pretrain translators, we employ a self-supervised contrastive learning approach, which aligns with celebrated vision-language models such as CLIP (Radford et al.,, 2021 ) and ALIGN (Jia et al.,, 2021 ) . Let 𝒟 𝚁𝚎𝚙 subscript 𝒟 𝚁𝚎𝚙 \mathcal{D}_{\mathtt{Rep}} be the contrastive pretraining dataset for translators, which is also constructed upon the dataset 𝒟 𝒟 \mathcal{D} . Following the framework adopted in Qiu et al., ( 2022 ); Zhang et al., ( 2022 ) , for each observation-state pair ( o , s ) 𝒟 𝑜 𝑠 𝒟 (o,s)\in\mathcal{D} , a positive or a negative data point, labelled as y = 1 𝑦 1 y=1 and y = 0 𝑦 0 y=0 , is generated with equal probability, following that

  • -

    Positive Data: Collect ( o , s ) 𝑜 𝑠 (o,s) with label y = 1 𝑦 1 y=1 .

  • -

    Negative Data: Collect ( o , s ) 𝑜 superscript 𝑠 (o,s^{-}) with label y = 0 𝑦 0 y=0 , where s superscript 𝑠 s^{-} is sampled from negative sampling distribution 𝒫 Δ ( 𝒪 ) superscript 𝒫 Δ 𝒪 \mathcal{P}^{-}\in\Delta(\mathcal{O}) that has a full support over the domain of interest.

Denote 𝒞 subscript 𝒞 \mathbb{P}_{\mathcal{C}} as the joint distribution of data collected by the process above. The learning algorithm follows that γ ^ = argmin γ Γ CT ( γ ; 𝒟 𝚁𝚎𝚙 ) ^ 𝛾 subscript argmin 𝛾 Γ subscript CT 𝛾 subscript 𝒟 𝚁𝚎𝚙 \widehat{\gamma}={\rm argmin}_{\gamma\in\Gamma}\ \mathcal{L}_{\mathrm{CT}}(\gamma;\mathcal{D}_{\mathtt{Rep}}) , where the contrastive loss CT ( γ ; 𝒟 𝚁𝚎𝚙 ) subscript CT 𝛾 subscript 𝒟 𝚁𝚎𝚙 \mathcal{L}_{\rm CT}(\gamma;\mathcal{D}_{\mathtt{Rep}}) is defined as

CT ( γ ; 𝒟 𝚁𝚎𝚙 ) := 𝔼 ^ 𝒟 𝚁𝚎𝚙 [ y log ( 1 + f γ ( o , s ) 1 ) + ( 1 y ) log ( 1 + f γ ( o , s ) ) ] . assign subscript CT 𝛾 subscript 𝒟 𝚁𝚎𝚙 subscript ^ 𝔼 subscript 𝒟 𝚁𝚎𝚙 delimited-[] 𝑦 1 subscript 𝑓 𝛾 superscript 𝑜 𝑠 1 1 𝑦 1 subscript 𝑓 𝛾 𝑜 𝑠 \displaystyle\mathcal{L}_{\rm CT}(\gamma;\mathcal{D}_{\mathtt{Rep}}):=\mathbb{\widehat{E}}_{\mathcal{D}_{\mathtt{Rep}}}[y\cdot\log\left(1+{f_{\gamma}(o,s)^{-1}}\right)+(1-y)\cdot\log\left({1+f_{\gamma}(o,s)}\right)]. (3.8)

Consider function class γ subscript 𝛾 \mathcal{F}_{\gamma} with finite elements with γ ( 𝒮 × 𝒪 ) subscript 𝛾 maps-to 𝒮 𝒪 \mathcal{F}_{\gamma}\subseteq(\mathcal{S}\times\mathcal{O}\mapsto\mathbb{R}) serving as a set of candidate functions that approximates the ground-truth likelihood ratio f ( , ) = 𝕆 ( | ) / 𝒫 ( ) f^{*}(\cdot,\cdot)=\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\cdot)/\mathcal{P}^{-}(\cdot) (see Lemma D.2 for justification). Following this, the pretrained translator for the Reporter by the algorithm in ( 3.8 ) is thus defined as 𝕆 γ ^ ( | ) = f γ ^ ( , ) 𝒫 ( ) \mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\cdot)=f_{\widehat{\gamma}}(\cdot,\cdot)\cdot\mathcal{P}^{-}(\cdot) . More details are deferred to § 5.2 .

Remark 3.1 .

In ( 3.5 ), we assume that all pretraining data is generated from a joint distribution 𝒟 subscript 𝒟 \mathbb{P}_{\mathcal{D}} , and then split for pretraining of LLM and Reporter. In practice, the pretraining dataset for the Reporter can consist of paired observation-state data collected from any arbitrary distribution, as long as (i) the LLM and Reporter “speak” the same language, i.e., shared 𝕆 𝕆 \mathbb{O} , and (ii) the coverage assumption can hold (see Assumption 5.6 ).

Remark 3.2 .

As an example, noise contrastive estimation (NCE, Gutmann and Hyvärinen,, 2010 ) is one of the most widely adopted objectives in contrastive representation learning. From the theoretical lens, to estimate unnormalized model p d subscript 𝑝 𝑑 p_{d} with x i iid p d subscript 𝑥 𝑖 iid similar-to subscript p d x_{i}\overset{\rm iid}{\sim}p_{d} , additional noise data is sampled from a reference distribution p n subscript 𝑝 𝑛 p_{n} and then estimate by maximizing 𝔼 ^ [ y log ( h γ ( x ) ) + ( 1 y ) log ( 1 h γ ( x ) ) ] ^ 𝔼 delimited-[] 𝑦 subscript 𝛾 𝑥 1 𝑦 1 subscript 𝛾 𝑥 \widehat{\mathbb{E}}[y\cdot\log(h_{\gamma}(x))+(1-y)\cdot\log(1-h_{\gamma}(x))] with y = 𝟙 ( x is not noise ) 𝑦 1 𝑥 is not noise y=\operatorname{\mathds{1}}(x\text{~{}is not noise}) and h ( x ) = p d ( x ) / ( p d ( x ) + p n ( x ) ) superscript 𝑥 subscript 𝑝 𝑑 𝑥 subscript 𝑝 𝑑 𝑥 subscript 𝑝 𝑛 𝑥 h^{*}(x)=p_{d}(x)/(p_{d}(x)+p_{n}(x)) . With slight modifications, we use a function class \mathcal{F} to approximate the ratio p d / p n subscript 𝑝 𝑑 subscript 𝑝 𝑛 p_{d}/p_{n} rather than the relative probability h h itself. In practice, the most commonly used contrastive training objectives are variations of NCE and originated from the NLP domain (Schiappa et al.,, 2023 ) by sharing the same idea of minimizing the distance between the positive pair and maximizing the distance between the negative pairs .

4 LLM Planning via Bayesian Aggregated Imitation Learning

In this section, we first demonstrate that LLMs can conduct high-level planning through Bayesian aggregated imitation learning ( BAIL ) in § 4.1 , leveraging the ICL ability of LLMs with the history-dependent prompts. However, depending solely on LLM’s recommendations proves insufficient for achieving sample efficiency under the worst case (see Proposition 4.3 ). Following this, we propose a planning algorithm for Planner in § 4.2 , leveraging LLMs for expert recommendations, in addition to an exploration strategy.

4.1 Bayesian Aggregated Imitation Learning

In this subsection, we show that the LLM conducts high-level task planning via BAIL , integrating both Bayesian model averaging (BMA, Hoeting et al.,, 1999 ) during the online planning and imitation learning (IL, Ross and Bagnell,, 2010 ) during the offline pretraining. Intuitively, pretrained over 𝒟 𝙻𝙻𝙼 subscript 𝒟 𝙻𝙻𝙼 \mathcal{D}_{\mathtt{LLM}} , LLM approximates the conditional distribution 𝙻𝙻𝙼 ( = | S ) = 𝒟 ( = | S ) \mathtt{LLM}(\ell=\cdot\hskip 1.42262pt|\hskip 1.42262ptS)=\mathbb{P}_{\mathcal{D}}(\ell=\cdot\hskip 1.42262pt|\hskip 1.42262ptS) , where 𝒟 subscript 𝒟 \mathbb{P}_{\mathcal{D}} is the joint distribution in ( 3.5 ) and the randomness introduced by the latent variable is aggregated, i.e., 𝒟 ( = | S ) = 𝔼 z 𝒟 ( | S ) [ 𝒟 ( = | S , z ) ] \mathbb{P}_{\mathcal{D}}(\ell=\cdot\hskip 1.42262pt|\hskip 1.42262ptS)=\mathbb{E}_{z\sim\mathbb{P}_{\mathcal{D}}(\cdot|S)}\left[\mathbb{P}_{\mathcal{D}}(\ell=\cdot\hskip 1.42262pt|\hskip 1.42262ptS,z)\right] . Here, 𝒟 ( = | S , z ) \mathbb{P}_{\mathcal{D}}(\ell=\cdot\hskip 1.42262pt|\hskip 1.42262ptS,z) can be viewed as a generating distribution with a known z 𝑧 z and is then aggregated over the posterior distribution 𝒟 ( z = | S ) \mathbb{P}_{\mathcal{D}}(z=\cdot\hskip 1.42262pt|\hskip 1.42262ptS) , aligning with the form of BMA (Zhang et al.,, 2023 ) . We temporarily consider the perfect setting.

Definition 4.1 (Perfect Setting) .

We say the PAR system is perfectly pretrained if (i) 𝕆 γ ^ ( | s ) = 𝕆 ( | s ) \mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)=\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts) for all s 𝒮 𝑠 𝒮 s\in\mathcal{S} , (ii) 𝙻𝙻𝙼 θ ^ ( | S t ) = 𝙻𝙻𝙼 ( | S t ) \mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262ptS_{t})=\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262ptS_{t}) for all S t = ( 1 , , t ) 𝔏 subscript 𝑆 𝑡 subscript 1 subscript 𝑡 superscript 𝔏 S_{t}=(\ell_{1},\dots,\ell_{t})\in\mathfrak{L}^{*} with length t T ¯ p 𝑡 subscript ¯ 𝑇 p t\leq\bar{T}_{\rm p} .

The assumption states that the Reporter and LLMs can report and predict with ground-truth distributions employed based on the joint distribution 𝒟 subscript 𝒟 \mathbb{P}_{\mathcal{D}} . During ICL, we invoke LLMs by history-dependent 𝚙𝚝 h t = t { ω t , τ h t } 𝔏 superscript subscript 𝚙𝚝 𝑡 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝜏 𝑡 superscript 𝔏 \mathtt{pt}_{h}^{t}=\mathcal{H}_{t}\cup\{\omega^{t},\tau_{h}^{t}\}\in\mathfrak{L}^{*} for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] . Conditioned on latent variable z 𝑧 z and 𝚙𝚝 h t superscript subscript 𝚙𝚝 𝑡 \mathtt{pt}_{h}^{t} , the generating distribution is the optimal policy such that 𝒟 ( | 𝚙𝚝 h t , z ) = π z , h ( | τ h t , ω t ) \mathbb{P}_{\mathcal{D}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t},z)=\pi^{*}_{z,h}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}) , which is independent of historical t subscript 𝑡 \mathcal{H}_{t} . In this sense, LLMs imitate expert policies during pretraining. The proposition below shows that LLMs conduct task planning via BAIL .

Proposition 4.2 (LLM Performs BAIL) .

Assume that the pretraining data distribution is given by ( 3.5 ). Under the perfect setting in Definition 4.1 , for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , the LLM conducts task planning via BAIL , following that

π h , 𝙻𝙻𝙼 t ( | τ h t , ω t ) = z 𝒵 π z , h ( | τ h t , ω t ) 𝒟 ( z | 𝚙𝚝 h t ) , \displaystyle\pi_{h,\mathtt{LLM}}^{t}\left(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right)=\sum_{z\in\mathcal{Z}}\pi^{*}_{z,h}\left(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right),

where π h , 𝙻𝙻𝙼 t superscript subscript 𝜋 𝙻𝙻𝙼 𝑡 \pi_{h,\mathtt{LLM}}^{t} denotes the LLM’s policy and prompt is defined in ( 3.2 ).

Proof of Proposition 4.2 ..

Please refer to § C.1 for a detailed proof. ∎

Proposition 4.2 suggests that LLMs provide recommendations following a two-fold procedure: Firstly, LLMs compute the posterior belief of each latent variable z 𝒵 𝑧 𝒵 z\in\mathcal{Z} from 𝚙𝚝 h t superscript subscript 𝚙𝚝 𝑡 \mathtt{pt}_{h}^{t} . Secondly, LLMs aggregate the optimal policies over posterior probability and provide recommendations.

4.2 LLM-Empowered Planning Algorithm

Algorithm 1 Planning with PAR System - Planner
1: Policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} with η ( 0 , 1 ) 𝜂 0 1 \eta\in(0,1) , c 𝒵 > 0 subscript 𝑐 𝒵 0 c_{\mathcal{Z}}>0 , and | 𝒵 | 𝒵 |\mathcal{Z}|\in\mathbb{N} .
2: 0 { } subscript 0 \mathcal{H}_{0}\leftarrow\{\} , and ϵ ( H log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 italic-ϵ superscript 𝐻 subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 \epsilon\leftarrow(H\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})/T\eta)^{1/2} .
3: for episode t 𝑡 t from 1 1 1 to T 𝑇 T do
4: Receive the high-level task ω t superscript 𝜔 𝑡 \omega^{t} from the human user.
5: Sample t Bernuolli ( ϵ ) similar-to subscript 𝑡 Bernuolli italic-ϵ \mathcal{I}_{t}\sim\text{Bernuolli}(\epsilon) .
6: for step h h from 1 1 1 to H 𝐻 H do
7: Collect the observation o h t superscript subscript 𝑜 𝑡 o_{h}^{t} from the Reporter.
8: Set 𝚙𝚝 h t t { ω t , o 1 t , , o h t } superscript subscript 𝚙𝚝 𝑡 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝑜 1 𝑡 superscript subscript 𝑜 𝑡 \mathtt{pt}_{h}^{t}\leftarrow\mathcal{H}_{t}\cup\{\omega^{t},o_{1}^{t},\dots,o_{h}^{t}\} .
9: Sample g h , 𝙻𝙻𝙼 t 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) g_{h,\mathtt{LLM}}^{t}\sim\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}) via prompting LLM.
10: If t = 1 subscript 𝑡 1 \mathcal{I}_{t}=1 then g h t g h , 𝙻𝙻𝙼 t superscript subscript 𝑔 𝑡 superscript subscript 𝑔 𝙻𝙻𝙼 𝑡 g_{h}^{t}\leftarrow g_{h,\mathtt{LLM}}^{t} , else sample g h t π h , 𝚎𝚡𝚙 ( | τ h t ) g_{h}^{t}\sim\pi_{h,\mathtt{exp}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t}) .
11: Send the subgoal g h t superscript subscript 𝑔 𝑡 g_{h}^{t} to the Actor.
12: end for
13: Update t + 1 t { ω t , τ H t } subscript 𝑡 1 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝜏 𝐻 𝑡 \mathcal{H}_{t+1}\leftarrow\mathcal{H}_{t}\cup\{\omega^{t},\tau_{H}^{t}\} .
14: end for

Following the arguments above, we propose a planning algorithm for the Planner within a perfect PAR system. From a high level, the process of task planning is an implementation of policies from imitation learning (Ross and Bagnell,, 2010 ; Ross et al.,, 2011 ) with two key distinctions: (i) Planner collaborates with LLM, a “nascent” expert that learns the hidden intricacies of the external world from updating prompts; (ii) different from behavior cloning or inverse RL, Planner does not aim to comprehend LLM’s behaviors. Instead, the imitation is accomplished during the offline pretraining, and Planner shall selectively adhere to LLM’s suggestions during online planning. Next, we show that task planning solely guided by LLMs fails to achieve sample efficiency in the worst case.

Proposition 4.3 (Hard-to-Distinguish Example) .

Suppose that Definition 4.1 holds. Given any T 𝑇 T\in\mathbb{N} , there exists an HMDP and specific latent varibale z 𝒵 𝑧 𝒵 z\in\mathcal{Z} such that if Planner strictly follows LLM’s recommended policies in Proposition 4.2 , it holds that Reg z ( T ) 0.5 T ( 1 1 / | 𝒵 | ) T subscript Reg 𝑧 𝑇 0.5 𝑇 superscript 1 1 𝒵 𝑇 {\rm Reg}_{z}(T)\geq 0.5T\cdot(1-1/|\mathcal{Z}|)^{T} .

Proof of Proposition 4.3 ..

Please refer to § C.4 for a detailed proof. ∎

Proposition 4.3 indicates that relying solely on LLMs for task planning can result in a suboptimal Ω ( T ) Ω 𝑇 \Omega(T) regret in the worst case when | Z | = T 𝑍 𝑇 |Z|=T . Thus, additional exploration is essential to discern the latent information about the external world, a parallel to the practical implementations in latent imitation learning (Edwards et al.,, 2019 ; Kidambi et al.,, 2021 ) and LLM-based reasoning (Hao et al.,, 2023 ; Nottingham et al.,, 2023 ) . In practice, while the language model can guide achieving a goal, it’s important to note that this guidance is not grounded in real-world observations . Thus, as pointed out by Grigsby et al., ( 2023 ) , the information provided in narratives might be arbitrarily wrong, which highlights the need for exploration to navigate new environments effectively . Similar to ϵ italic-ϵ \epsilon -greedy algorithms (Tokic and Palm,, 2011 ; Dann et al.,, 2022 ) , we provide a simple but efficient algorithm for LLM-empowered task planning. Algorithm 1 gives the pseudocode. In each episode, the Planner performs two main steps:

  • -

    Policy Decision ( 𝙻𝚒𝚗𝚎 5 𝙻𝚒𝚗𝚎 5 \mathtt{Line\ 5} ): Randomly decide whether to execute the exploration policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} or follow the LLM’s recommendations within this episode with probability ϵ italic-ϵ \epsilon .

  • -

    Planning with LLMs ( 𝙻𝚒𝚗𝚎 7 𝟷𝟶 𝙻𝚒𝚗𝚎 7 10 \mathtt{Line\ 7-10} ): If Planner decides to follow the LLM’s recommendations, the subgoal is obtained by prompting LLMs with 𝚙𝚝 h t = t { ω t , τ h t } superscript subscript 𝚙𝚝 𝑡 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝜏 𝑡 \mathtt{pt}_{h}^{t}=\mathcal{H}_{t}\cup\{\omega^{t},\tau_{h}^{t}\} , equivalently sampling from 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) \mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}) . Otherwise, the Planner takes sub-goal from π h , 𝚎𝚡𝚙 ( | τ h t ) \pi_{h,\mathtt{exp}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t}) .

In conventional ϵ italic-ϵ \epsilon -greedy algorithms, explorations are taken uniformly over the action space 𝒢 𝒢 \mathcal{G} , i.e., π 𝚎𝚡𝚙 = Unif 𝒢 subscript 𝜋 𝚎𝚡𝚙 subscript Unif 𝒢 \pi_{\mathtt{exp}}={\rm Unif}_{\mathcal{G}} . Recent work has extended it to a collection of distributions (e.g., softmax, Gaussian noise) for function approximation (Dann et al.,, 2022 ) . Following this, we instead consider a broader class of exploration strategies that satisfy the η 𝜂 \eta -distinguishability property below.

Definition 4.4 ( η 𝜂 \eta -distinguishability) .

We say an exploration policy π 𝚎𝚡𝚙 = { π h , 𝚎𝚡𝚙 } h [ H ] subscript 𝜋 𝚎𝚡𝚙 subscript subscript 𝜋 𝚎𝚡𝚙 delimited-[] 𝐻 \pi_{\mathtt{exp}}=\{\pi_{h,\mathtt{exp}}\}_{h\in[H]} is η 𝜂 \eta -distinguishable if there exists an absolute constant η > 0 𝜂 0 \eta>0 such that for all z , z 𝒵 𝑧 superscript 𝑧 𝒵 z,z^{\prime}\in\mathcal{Z} with z z 𝑧 superscript 𝑧 z\neq z^{\prime} , it holds that D H 2 ( z π 𝚎𝚡𝚙 ( τ H ) , z π 𝚎𝚡𝚙 ( τ H ) ) η superscript subscript 𝐷 H 2 subscript superscript subscript 𝜋 𝚎𝚡𝚙 𝑧 subscript 𝜏 𝐻 subscript superscript subscript 𝜋 𝚎𝚡𝚙 superscript 𝑧 subscript 𝜏 𝐻 𝜂 D_{\rm H}^{2}\left(\mathbb{P}^{\pi_{\mathtt{exp}}}_{z}(\tau_{H}),\mathbb{P}^{\pi_{\mathtt{exp}}}_{z^{\prime}}\left(\tau_{H}\right)\right)\geq\eta .

The η 𝜂 \eta -distinguishability implies the existence of exploration policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} that could well-distinguish the models with an η 𝜂 \eta -gap in Hellinger distance concerning the distribution of whole trajectory, which also impose condition over the model seperation. Next, we introduce the assumption over priori.

Assumption 4.5 (Prior coverage) .

There exists a constant c 𝒵 > 0 subscript 𝑐 𝒵 0 c_{\mathcal{Z}}>0 such that sup z , z 𝒫 𝒵 ( z ) 𝒫 𝒵 ( z ) c 𝒵 subscript supremum 𝑧 superscript 𝑧 subscript 𝒫 𝒵 superscript 𝑧 subscript 𝒫 𝒵 𝑧 subscript 𝑐 𝒵 \sup_{z,z^{\prime}}\frac{\mathcal{P}_{\mathcal{Z}}(z^{\prime})}{\mathcal{P}_{\mathcal{Z}}(z)}\leq c_{\mathcal{Z}} .

The assumption asserts a bounded ratio of priors, implying that each z 𝒵 𝑧 𝒵 z\in\mathcal{Z} has a non-negligible prior probability. The assumption is intuitive, as a negligible priori suggests such a scenario almost surely does not occur, rendering the planning in such scenarios unnecessary. Now, we are ready to present the main theorem of the Planner under perfect setting.

Theorem 4.6 (Regret under Perfect Setting) .

Suppose that Definition 4.1 and Assumption 4.5 hold. Given an η 𝜂 \eta -distinguishable exploration policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} and T T p 𝑇 subscript 𝑇 p T\leq T_{\rm p} , Algorithm 1 ensures

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) := t = 1 T 𝔼 t [ 𝒥 z ( π z , ω t ) 𝒥 z ( π t , ω t ) ] 𝒪 ~ ( H 3 2 T / η log ( c 𝒵 | 𝒵 | T ) ) , assign absent superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript 𝒥 𝑧 superscript subscript 𝜋 𝑧 superscript 𝜔 𝑡 subscript 𝒥 𝑧 superscript 𝜋 𝑡 superscript 𝜔 𝑡 ~ 𝒪 superscript 𝐻 3 2 𝑇 𝜂 subscript 𝑐 𝒵 𝒵 𝑇 \displaystyle:=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{J}_{z}(\pi_{z}^{*},\omega^{t})-\mathcal{J}_{z}(\pi^{t},\omega^{t})\right]\leq\tilde{\mathcal{O}}\left(H^{\frac{3}{2}}\sqrt{T/\eta\cdot\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})}\right),

for any z 𝒵 𝑧 𝒵 z\in\mathcal{Z} and { ω t } t [ T ] subscript superscript 𝜔 𝑡 𝑡 delimited-[] 𝑇 \{\omega^{t}\}_{t\in[T]} , if the Planner explores with probability ϵ = ( H log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 italic-ϵ superscript 𝐻 subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 \epsilon=(H\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})/T\eta)^{1/2} .

Proof of Theorem 4.6 ..

Please refer to § C.2 for a detailed proof. ∎

Theorem 4.6 states that the Planner’s algorithm can attain a 𝒪 ~ ( T ) ~ 𝒪 𝑇 \tilde{\mathcal{O}}(\sqrt{T}) regret for planning facilitated by LLMs. The multiplicative factor of the regret depends on the horizon of the interactive process H 𝐻 H , the reciprocal of coverage rate η 𝜂 \eta in Definition 4.4 , and the logarithmic term log ( c 𝒵 | 𝒵 | ) subscript 𝑐 𝒵 𝒵 \log\left(c_{\mathcal{Z}}|\mathcal{Z}|\right) including both the cardinality of candidate models and the prior coverage in Assumption 4.5 , which jointly characterizes the complexity of the physical world.

Remark 4.7 .

Lee et al., ( 2023 ) has demonstrated that a perfect decision-pretrained transformer, similar to the role of LLM in ours, can attain a 𝒪 ~ ( H 3 2 T ) ~ 𝒪 superscript 𝐻 3 2 𝑇 \tilde{\mathcal{O}}(H^{\frac{3}{2}}\sqrt{T}) Bayesian regret, i.e., 𝔼 z 𝒫 𝒵 [ Reg ( T ) ] subscript 𝔼 similar-to 𝑧 subscript 𝒫 𝒵 delimited-[] Reg 𝑇 \mathbb{E}_{z\sim\mathcal{P}_{\mathcal{Z}}}[{\rm Reg}(T)] , via ICL. In comparison, we focus on a more challenging setting that aims to control the frequentist regret, which is closer to applications, and attain a comparable result with additional exploration.

5 Performance under Practical Setting

5.1 Pretraining Large Language Model

In this subsection, we elaborate on the pretraining of LLMs using transformer architecture. We employ a supervised learning algorithm minimizing the cross-entropy loss, i.e., θ ^ = argmin θ Θ CE ( θ ; 𝒟 𝙻𝙻𝙼 ) ^ 𝜃 subscript argmin 𝜃 Θ subscript CE 𝜃 subscript 𝒟 𝙻𝙻𝙼 \widehat{\theta}={\rm argmin}_{\theta\in\Theta}\ \mathcal{L}_{\mathrm{CE}}(\theta;\mathcal{D}_{\mathtt{LLM}}) , as detailed in ( 3.8 ). Following this, the population risk follows that

CE ( θ ; 𝒟 𝙻𝙻𝙼 ) = 𝔼 t [ 𝔼 S t [ D KL ( 𝙻𝙻𝙼 ( | S t ) 𝙻𝙻𝙼 θ ( | S t ) ) + Ent ( 𝙻𝙻𝙼 ( | S t ) ) ] ] , \displaystyle\mathcal{R}_{\mathrm{CE}}(\theta;\mathcal{D}_{\mathtt{LLM}})=\mathbb{E}_{t}[\mathbb{E}_{S_{t}}[D_{\rm KL}(\mathtt{LLM}(\cdot|S_{t})\hskip 1.42262pt\|\hskip 1.42262pt\mathtt{LLM}_{\theta}(\cdot|S_{t}))+{\rm Ent}(\mathtt{LLM}(\cdot|S_{t}))]],

where t Unif ( [ T ¯ p ] ) similar-to 𝑡 Unif delimited-[] subscript ¯ 𝑇 p t\sim{\rm Unif}({[\bar{T}_{\rm p}]}) , S t subscript 𝑆 𝑡 S_{t} is distributed as the pretraining distribution, and Ent ( ) = 𝔼 x [ log ( x ) ] Ent subscript 𝔼 similar-to 𝑥 delimited-[] 𝑥 {\rm Ent}(\mathbb{P})=\mathbb{E}_{x\sim\mathbb{P}}[\log\mathbb{P}(x)] is the Shannon entropy. As the minimum is achieved at 𝙻𝙻𝙼 θ ( | S ) = 𝙻𝙻𝙼 ( | S ) \mathtt{LLM}_{\theta}(\cdot|S)=\mathtt{LLM}(\cdot|S) , estimated 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} and 𝙻𝙻𝙼 𝙻𝙻𝙼 \mathtt{LLM} are expected to converge under the algorithm with a sufficiently large dataset. Specifically, our design adopts a transformer function class to stay consistent with the architectural choices of language models like BERT and GPT. Specifically, a transformer model comprises D 𝐷 D sub-modules, with each sub-module incorporating a Multi-Head Attention (MHA) mechanism and a fully connected Feed-Forward (FF) layer. See § A.2 for further details, and we specify two widely adopted assumptions in the theoretical analysis of LLM pretraining (Wies et al.,, 2023 ; Zhang et al.,, 2023 ) .

Assumption 5.1 (Boundedness) .

For all z 𝒵 𝑧 𝒵 z\in\mathcal{Z} and t T ¯ p 𝑡 subscript ¯ 𝑇 p t\leq\bar{T}_{\rm p} , there exists a constant R > 0 𝑅 0 R>0 such that all S t = ( 1 , , t ) 𝒟 ( | z ) S_{t}=(\ell_{1},\dots,\ell_{t})\sim\mathbb{P}_{\mathcal{D}}(\cdot\hskip 1.42262pt|\hskip 1.42262ptz) with S t 𝔏 subscript 𝑆 𝑡 superscript 𝔏 S_{t}\in\mathfrak{L}^{*} satisfies that S t 2 , R subscript norm subscript 𝑆 𝑡 2 𝑅 \|S_{t}\|_{2,\infty}\leq R almost surely.

The boundedness assumption requires that the 2 subscript 2 \ell_{2} -norm of the magnitude of each token is upper bounded by R > 0 𝑅 0 R>0 , and such an assumption holds in most settings.

Assumption 5.2 (Ambiguity) .

For all latent variable z 𝒵 𝑧 𝒵 z\in\mathcal{Z} , there exists a constant c 0 > 0 subscript 𝑐 0 0 c_{0}>0 such that for all t + 1 𝔏 subscript 𝑡 1 𝔏 \ell_{t+1}\in\mathfrak{L} and S t = ( 1 , , t ) 𝔏 subscript 𝑆 𝑡 subscript 1 subscript 𝑡 superscript 𝔏 S_{t}=(\ell_{1},\dots,\ell_{t})\in\mathfrak{L}^{*} with length t < T ¯ p 𝑡 subscript ¯ 𝑇 p t<\bar{T}_{\rm p} , it holds 𝒟 ( t + 1 | S t , z ) c 0 subscript 𝒟 conditional subscript 𝑡 1 subscript 𝑆 𝑡 𝑧 subscript 𝑐 0 \mathbb{P}_{\mathcal{D}}(\ell_{t+1}\hskip 1.42262pt|\hskip 1.42262ptS_{t},z)\geq c_{0} .

The ambiguity assumption states that the generating distribution is lower bounded, and the assumption is grounded in reasoning as there may be multiple plausible choices for the subsequent words to convey the same meaning. Next, we present the performance of the pretrained LLMs.

Theorem 5.3 ( Zhang et al., ( 2023 ) ) .

Suppose that Assumptions 5.1 and 5.2 hold. With probability at least 1 δ 1 𝛿 1-\delta , the pretrained model 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} by the algorithm in ( 3.7 ) satisfies that

𝔼 ¯ 𝒟 𝙻𝙻𝙼 [ D TV ( 𝙻𝙻𝙼 ( | S ) , 𝙻𝙻𝙼 θ ^ ( | S ) ) ] \displaystyle\bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{LLM}}}\left[D_{\rm TV}\left(\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262ptS),\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262ptS)\right)\right]
𝒪 ( inf θ Θ 𝔼 ¯ 𝒟 𝙻𝙻𝙼 [ D KL ( 𝙻𝙻𝙼 ( | S ) , 𝙻𝙻𝙼 θ ( | S ) ) ] \displaystyle\quad{\leq\mathcal{O}\biggl{(}\inf_{\theta^{*}\in\Theta}\sqrt{\bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{LLM}}}\left[D_{\mathrm{KL}}\left(\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262ptS),\mathtt{LLM}_{\theta^{*}}(\cdot\hskip 1.42262pt|\hskip 1.42262ptS)\right)\right]}}
+ t mix 1 / 4 log 1 δ ( N p T ¯ p ) 1 / 4 + t mix N p T ¯ p ( D ¯ log ( 1 + B ¯ N p T ¯ p ) + log 1 δ ) ) , \displaystyle\qquad+\frac{t_{\rm mix}^{1/4}\log\frac{1}{\delta}}{(N_{\mathrm{p}}\bar{T}_{\mathrm{p}})^{1/4}}+\sqrt{\frac{t_{\rm mix}}{N_{\mathrm{p}}\bar{T}_{\mathrm{p}}}}\biggl{(}\bar{D}\log\bigl{(}1+\bar{B}N_{\mathrm{p}}\bar{T}_{\mathrm{p}}\bigr{)}+\log\frac{1}{\delta}\biggl{)}\biggl{)},

where B ¯ ¯ 𝐵 \bar{B} and D ¯ ¯ 𝐷 \bar{D} features the tranformer’s architecture, t mix subscript 𝑡 mix t_{\rm mix} denotes the mixing time of Markov chain { S t } t [ T ] subscript subscript 𝑆 𝑡 𝑡 delimited-[] 𝑇 \{S_{t}\}_{t\in[T]} 2 2 2 Note that { S t n } t [ T ] subscript subscript superscript 𝑆 𝑛 𝑡 𝑡 delimited-[] 𝑇 \{S^{n}_{t}\}_{t\in[T]} directly satisfies Markov property since S t n = ( 1 n , , t n ) superscript subscript 𝑆 𝑡 𝑛 superscript subscript 1 𝑛 superscript subscript 𝑡 𝑛 S_{t}^{n}=(\ell_{1}^{n},\dots,\ell_{t}^{n}) and thus S i n S t n subscript superscript 𝑆 𝑛 𝑖 subscript superscript 𝑆 𝑛 𝑡 S^{n}_{i}\subseteq S^{n}_{t} for all i t 𝑖 𝑡 i\leq t . , and N p T ¯ p subscript 𝑁 p subscript ¯ 𝑇 p N_{\mathrm{p}}\bar{T}_{\mathrm{p}} is the size of dataset 𝒟 𝙻𝙻𝙼 subscript 𝒟 𝙻𝙻𝙼 \mathcal{D}_{\mathtt{LLM}} . See § A.2 for detailed structure and definitions.

Proof of Theorem 2 ..

Please refer to Theorem 5.3 in Zhang et al., ( 2023 ) for a detailed proof. ∎

Theorem 2 states that the total variation of the conditional distribution, with expectation taken over the average distribution of context S 𝑆 S in 𝒟 𝙻𝙻𝙼 subscript 𝒟 𝙻𝙻𝙼 \mathcal{D}_{\mathtt{LLM}} (see Table 1 for definition), converges at 𝒪 ( ( N p T ¯ p ) 1 / 2 ) 𝒪 superscript subscript 𝑁 p subscript ¯ 𝑇 p 1 2 \mathcal{O}\left((N_{\mathrm{p}}\bar{T}_{\mathrm{p}})^{-1/2}\right) . Note that the first two terms represent the approximation error and deep neural networks act as a universal approximator (Yarotsky,, 2017 ) such that the error would vanish with increasing volume of network (Proposition C.4, Zhang et al.,, 2023 ) . For notational simplicity, we denote the right-hand side of theorem as Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta) .

5.2 Pretraining Observation-to-Language Translator

In this subsection, we focus on the pretraining of observation-to-language translators under a self-supervised learning architecture using the contrastive loss. Consider the function class

γ = { f γ ( , ) : γ Γ , f γ B , 1 / f γ B } , subscript 𝛾 conditional-set subscript 𝑓 𝛾 formulae-sequence 𝛾 Γ formulae-sequence subscript norm subscript 𝑓 𝛾 subscript 𝐵 subscript norm 1 subscript 𝑓 𝛾 subscript superscript 𝐵 \mathcal{F}_{\gamma}=\{f_{\gamma}(\cdot,\cdot):\gamma\in\Gamma,\|f_{\gamma}\|_{\infty}\leq B_{\mathcal{F}},\|1/f_{\gamma}\|_{\infty}\leq B^{-}_{\mathcal{F}}\},

with finite elements, and the contrastive loss CT ( γ ; 𝒟 𝚁𝚎𝚙 ) subscript CT 𝛾 subscript 𝒟 𝚁𝚎𝚙 \mathcal{L}_{\rm CT}(\gamma;\mathcal{D}_{\mathtt{Rep}}) in ( 3.8 ) is then defined over γ subscript 𝛾 \mathcal{F}_{\gamma} . Note that the contrastive loss can be equivalently written as the negative log-likelihood loss of a binary discriminator, following that CT ( γ ; 𝒟 𝚁𝚎𝚙 ) = 𝔼 ^ 𝒟 𝚁𝚎𝚙 [ 𝔻 γ ( y | o , s ) ] subscript CT 𝛾 subscript 𝒟 𝚁𝚎𝚙 subscript ^ 𝔼 subscript 𝒟 𝚁𝚎𝚙 delimited-[] subscript 𝔻 𝛾 conditional 𝑦 𝑜 𝑠 \mathcal{L}_{\rm CT}(\gamma;\mathcal{D}_{\mathtt{Rep}})=\mathbb{\widehat{E}}_{\mathcal{D}_{\mathtt{Rep}}}\left[-\mathbb{D}_{\gamma}(y\hskip 1.42262pt|\hskip 1.42262pto,s)\right] , where we define

𝔻 γ ( y | o , s ) := ( f γ ( o , s ) 1 + f γ ( o , s ) ) y ( 1 1 + f γ ( o , s ) ) 1 y . assign subscript 𝔻 𝛾 conditional 𝑦 𝑜 𝑠 superscript subscript 𝑓 𝛾 𝑜 𝑠 1 subscript 𝑓 𝛾 𝑜 𝑠 𝑦 superscript 1 1 subscript 𝑓 𝛾 𝑜 𝑠 1 𝑦 \displaystyle\mathbb{D}_{\gamma}(y\hskip 1.42262pt|\hskip 1.42262pto,s):=\left(\frac{f_{\gamma}(o,s)}{1+f_{\gamma}(o,s)}\right)^{y}\left(\frac{1}{1+f_{\gamma}(o,s)}\right)^{1-y}. (5.1)

Based on ( 5.1 ) and the algorithm γ ^ = argmin γ Γ CT ( γ ; 𝒟 𝚁𝚎𝚙 ) ^ 𝛾 subscript argmin 𝛾 Γ subscript CT 𝛾 subscript 𝒟 𝚁𝚎𝚙 \widehat{\gamma}={\rm argmin}_{\gamma\in\Gamma}\ \mathcal{L}_{\mathrm{CT}}(\gamma;\mathcal{D}_{\mathtt{Rep}}) , the population risk follows that

CT ( γ ; 𝒟 𝚁𝚎𝚙 ) = 𝔼 [ D KL ( 𝔻 γ ( | o , s ) 𝔻 ( | o , s ) ) + Ent ( 𝔻 ( | o , s ) ) ] . \displaystyle\mathcal{R}_{\rm CT}(\gamma;\mathcal{D}_{\mathtt{Rep}})=\mathbb{E}\left[D_{\rm KL}\left(\mathbb{D}_{\gamma}(\cdot|o,s)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{D}(\cdot|o,s)\right)+{\rm Ent}(\mathbb{D}(\cdot|o,s))\right]. (5.2)

As the minimum is attained at 𝔻 γ ( | o , s ) = 𝔻 ( | o , s ) \mathbb{D}_{\gamma}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s)=\mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s) , where 𝔻 ( | o , s ) := 𝒞 ( | o , s ) \mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s):=\mathbb{P}_{\mathcal{C}}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s) is the distribution of the label conditioned on the ( o , s ) 𝑜 𝑠 (o,s) pair in contrastive data collection, estimated 𝔻 γ ^ ( | o , s ) \mathbb{D}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s) and 𝔻 ( | o , s ) \mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s) are expected to converge, and thus the learning target is the ground-truth likelihood ratio f ( o , s ) = 𝕆 ( o | s ) / 𝒫 ( o ) superscript 𝑓 𝑜 𝑠 𝕆 conditional 𝑜 𝑠 superscript 𝒫 𝑜 f^{*}(o,s)=\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)/\mathcal{P}^{-}(o) (see Lemma D.2 ). Below, we assume the learning target f ( o , s ) superscript 𝑓 𝑜 𝑠 f^{*}(o,s) is realizable in γ subscript 𝛾 \mathcal{F}_{\gamma} , which is standard in literature (Qiu et al.,, 2022 ) .

Assumption 5.4 (Realizability) .

Given a designated negative sampling distribution 𝒫 Δ ( 𝒪 ) superscript 𝒫 Δ 𝒪 \mathcal{P}^{-}\in\Delta(\mathcal{O}) , there exists γ Γ superscript 𝛾 Γ \gamma^{*}\in\Gamma such that f γ ( o , s ) = 𝕆 ( o | s ) / 𝒫 ( o ) subscript 𝑓 superscript 𝛾 𝑜 𝑠 𝕆 conditional 𝑜 𝑠 superscript 𝒫 𝑜 f_{\gamma^{*}}(o,s)=\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)/\mathcal{P}^{-}(o) for all ( o , s ) 𝒪 × 𝒮 𝑜 𝑠 𝒪 𝒮 (o,s)\in\mathcal{O}\times\mathcal{S} .

Next we present the performance of the pretrained translator.

Theorem 5.5 (Pretrained Translator) .

Suppose that Assumption 5.4 holds. With probability at least 1 δ 1 𝛿 1-\delta , the pretrained model 𝕆 γ ^ subscript 𝕆 ^ 𝛾 \mathbb{O}_{\widehat{\gamma}} by the algorithm in ( 5.1 ) satisfies that

𝔼 ¯ 𝒟 𝚁𝚎𝚙 [ D TV ( 𝕆 ( | s ) , 𝕆 γ ^ ( | s ) ) ] 𝒪 ( B ( B ) 1 / 2 ( N p T p H ) 1 / 2 log ( N p T p H | γ | / δ ) ) , \displaystyle\bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{Rep}}}\left[D_{\rm TV}\left(\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts),\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right)\right]\leq\mathcal{O}\left(\frac{B_{\mathcal{F}}(B^{-}_{\mathcal{F}})^{1/2}}{(N_{\rm p}T_{\rm p}H)^{1/2}}\sqrt{\log(N_{\rm p}T_{\rm p}H|\mathcal{F}_{\gamma}|/\delta)}\right),

where let 𝕆 γ ^ ( | s ) = f γ ^ ( | s ) 𝒫 ( ) \mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)=f_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\cdot\mathcal{P}^{-}(\cdot) and | γ | subscript 𝛾 |\mathcal{F}_{\gamma}| denotes the cardinality of the function class γ subscript 𝛾 \mathcal{F}_{\gamma} .

Proof of Theorem 5.5 ..

Please refer to § D.1 for a detailed proof. ∎

Theorem 5.5 posits that the average expectation of the total variation of the translation distribution regarding 𝒟 𝚁𝚎𝚙 subscript 𝒟 𝚁𝚎𝚙 \mathcal{D}_{\mathtt{Rep}} converges at 𝒪 ( ( N p T p ) 1 / 2 ) 𝒪 superscript subscript 𝑁 p subscript 𝑇 p 1 2 \mathcal{O}\left((N_{\mathrm{p}}{T}_{\mathrm{p}})^{-1/2}\right) . For notational simplicity, write the right-hand side of the theorem as Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta) . Furthermore, the algorithm also ensures a more stringent convergence guarantee concerning χ 2 superscript 𝜒 2 \chi^{2} -divergence: 𝔼 ¯ 𝒟 𝚁𝚎𝚙 [ χ 2 ( 𝕆 γ ^ ( | s ) 𝕆 ( | s ) ) ] Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 \bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{Rep}}}[\chi^{2}(\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts))]\leq\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2} .

5.3 Performance with Pretrained PAR System

In this subsection, we delve into the performance of task planning with pretrained PAR system. We first introduce the online coverage assumption, which pertains to the distribution of online planning trajectories under practical scenarios and trajectories in pretraining datasets.

Assumption 5.6 (Coverage) .

There exists absolute constants λ S > 0 subscript 𝜆 𝑆 0 \lambda_{S}>0 and λ R > 0 subscript 𝜆 𝑅 0 \lambda_{R}>0 such that for all latent variable z 𝒵 𝑧 𝒵 z\in\mathcal{Z} , t < T ¯ p 𝑡 subscript ¯ 𝑇 p t<\bar{T}_{\rm p} and policy sequence { π i } i t / 2 H subscript superscript 𝜋 𝑖 𝑖 𝑡 2 𝐻 \{\pi^{i}\}_{i\leq\lceil t/2H\rceil} from the Planner, it holds that (i) i = 1 t / 2 H ^ z π i ( S i ~ ) λ S ¯ 𝒟 𝙻𝙻𝙼 ( S t ) superscript subscript product 𝑖 1 𝑡 2 𝐻 superscript subscript ^ 𝑧 subscript 𝜋 𝑖 ~ subscript 𝑆 𝑖 subscript 𝜆 𝑆 subscript ¯ subscript 𝒟 𝙻𝙻𝙼 subscript 𝑆 𝑡 \prod_{i=1}^{\lceil t/2H\rceil}\mathbb{\widehat{P}}_{z}^{\pi_{i}}(\tilde{S_{i}})\leq\lambda_{S}\cdot\bar{\mathbb{P}}_{\mathcal{D}_{\mathtt{LLM}}}(S_{t}) for all ordered sequence S t = ( S ~ i ) i t / 2 H 𝔏 subscript 𝑆 𝑡 subscript subscript ~ 𝑆 𝑖 𝑖 𝑡 2 𝐻 superscript 𝔏 S_{t}=(\tilde{S}_{i})_{i\leq\lceil t/2H\rceil}\in\mathfrak{L}^{*} , where | S ~ i | = 2 H subscript ~ 𝑆 𝑖 2 𝐻 |\tilde{S}_{i}|=2H for all k < t / 2 H 𝑘 𝑡 2 𝐻 k<\lceil t/2H\rceil , and (ii) ¯ 𝒟 𝚁𝚎𝚙 ( s ) λ R subscript ¯ subscript 𝒟 𝚁𝚎𝚙 𝑠 subscript 𝜆 𝑅 \bar{\mathbb{P}}_{\mathcal{D}_{\mathtt{Rep}}}(s)\geq\lambda_{R} for all state s 𝒮 𝑠 𝒮 s\in\mathcal{S} .

Here, ^ z subscript ^ 𝑧 \mathbb{\widehat{P}}_{z} denotes the distribution of the dynamic system with the pretrained translator. The assumption asserts that (i) distribution of the ICL prompts induced by policy sequences { π i } i t / 2 H subscript superscript 𝜋 𝑖 𝑖 𝑡 2 𝐻 \{\pi^{i}\}_{i\leq\lceil t/2H\rceil} from the Planner under practical scenarios is covered by the pretraining data, where t / 2 H 𝑡 2 𝐻 \lceil t/2H\rceil denotes the number of episodes described in S t subscript 𝑆 𝑡 S_{t} . (ii) all states s 𝒮 𝑠 𝒮 s\in\mathcal{S} are covered by the average distribution of the Reporter’s pretraining dataset. Similar conditions are adopted in ICL analysis (Zhang et al.,, 2023 ) , decision pretrained transformer (Lee et al.,, 2023 ; Lin et al., 2023b, ) and offline RL (Munos,, 2005 ; Duan et al.,, 2020 ) . Intuitively, LLM and reporter cannot precisely plan or translate beyond the support of the pretraining dataset. These conditions are achievable if an explorative behavior strategy π b superscript 𝜋 𝑏 \pi^{b} is deployed with a sufficiently large N p subscript 𝑁 p N_{\rm p} when collecting data. We then present the main theorem regarding the practical performance.

Theorem 5.7 (Regret under Practical Setting) .

Suppose that Assumptions 4.5 , 5.1 , 5.2 , 5.4 and 5.6 . Given an η 𝜂 \eta -distinguishable exploration policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} and T T p 𝑇 subscript 𝑇 p T\leq T_{\rm p} , under the practical setting, the Planner’s algorithm in Algorithm 1 ensures that

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) 𝒪 ~ ( H 3 2 T / η log ( c 𝒵 | 𝒵 | T ) + H 2 T Δ p ( N p , T p , H , 1 T , ξ ) ) , absent ~ 𝒪 superscript 𝐻 3 2 𝑇 𝜂 subscript 𝑐 𝒵 𝒵 𝑇 superscript 𝐻 2 𝑇 subscript Δ p subscript 𝑁 p subscript 𝑇 p 𝐻 1 𝑇 𝜉 \displaystyle\leq\tilde{\mathcal{O}}\Big{(}H^{\frac{3}{2}}\sqrt{T/\eta\cdot\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})}+H^{2}T\cdot\Delta_{\rm p}(N_{\rm p},T_{\rm p},H,1\sqrt{T},\xi)\Big{)},

for any z 𝒵 𝑧 𝒵 z\in\mathcal{Z} and { ω t } t [ T ] subscript subscript 𝜔 𝑡 𝑡 delimited-[] 𝑇 \{\omega_{t}\}_{t\in[T]} . The cumulative pretraining error of PAR system follows that

Δ p subscript Δ p \displaystyle\Delta_{\rm p} ( N p , T p , H , δ , ξ ) = ( η λ R ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 𝜉 superscript 𝜂 subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 \displaystyle(N_{\rm p},T_{\rm p},H,\delta,\xi)=(\eta\lambda_{R})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}
+ 2 λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) + λ S Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) . 2 superscript subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 subscript 𝜆 𝑆 subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \displaystyle+2\lambda_{R}^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)+\lambda_{S}\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta).

where ξ = ( η , λ S , λ R ) 𝜉 𝜂 subscript 𝜆 𝑆 subscript 𝜆 𝑅 \xi=(\eta,\lambda_{S},\lambda_{R}) are defined in Definition 4.4 and Assumption 5.6 , and pretraining errors Δ 𝙻𝙻𝙼 subscript Δ 𝙻𝙻𝙼 \Delta_{\mathtt{LLM}} and Δ 𝚁𝚎𝚙 subscript Δ 𝚁𝚎𝚙 \Delta_{\mathtt{Rep}} are defined in Theorem 2 and Theorem 5.5 . Under the practical setting, Planner should explore with probability ϵ = ( H log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 + H ( η λ min ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , 1 / T ) 2 italic-ϵ superscript 𝐻 subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 𝐻 superscript 𝜂 subscript 𝜆 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 1 𝑇 2 \epsilon=\big{(}H\log\big{(}c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T}\big{)}/T\eta\big{)}^{1/2}+H(\eta\lambda_{\min})^{-1}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,1/\sqrt{T})^{2} .

Proof of Theorem 5.7 ..

Please refer to § D.2 for a detailed proof. ∎

Theorem 5.7 reveals that, in comparison to perfect scenario, the Planner can achieve an approximate 𝒪 ~ ( T ) ~ 𝒪 𝑇 \tilde{\mathcal{O}}(\sqrt{T}) regret, but incorporating an additional pretraining error term that could diminishe with an increase in the volume of pretraining data. Besides, it further underscores the necessity of exploration, where the Planner should explore with an additional H ( η λ min ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 𝐻 superscript 𝜂 subscript 𝜆 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 H(\eta\lambda_{\min})^{-1}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2} to handle the mismatch between the ground-truth and the pretrained environment.

Remark 5.8 .

The challenge of establishing a performance guarantee in a practical setting arises from the mismatch between the ground-truth environment and the pretrained one, leading to a distributional shift in posterior probability. Besides, BAIL is realized through a pretrained LLM, which introduces its pretraining error inaddition. In response, we propose a novel regret decomposition and provide the convergence rate of posterior probability with bounded pretraining errors, distinguishing ours from the previous results in Lee et al., ( 2023 ); Liu et al., ( 2023 ) .

Extentions.

We also present two extensions. In § B.1 , we discuss the design of Planner by taking LLMs as World Model (WM). Here, the Planner prompts the LLM to predict the next observation rather than subgoals, alleviating the reliance on expert knowledge. By leveraging model-based RL methods like Monte Carlo Tree Search (MCTS) and Real-Time Dynamic Programming (RTDP), the Planner utilizes the LLM-simulated environment to optimize its strategies based on the contextual information. As shown in Proposition B.1 , the simulated world model via ICL conforms to Bayesian Aggregated World Model (BAWM). Hence, the LLM Planner achieves a regret at rate of Reg z ( T ) 𝒪 ~ ( H T / η ) + H 2 T Δ p , wm subscript Reg 𝑧 𝑇 ~ 𝒪 𝐻 𝑇 𝜂 superscript 𝐻 2 𝑇 subscript Δ p wm {\rm Reg}_{z}(T)\leq\tilde{\mathcal{O}}(H\sqrt{T/\eta})+H^{2}T\cdot\Delta_{\rm p,wm} under practical setting with regularity conditions (see Corollary B.3 ). Besides, we extend the results in § 4 to accommodate the scenario of multi-agent collaboration, i.e., K 𝐾 K Actors. In § B.2 , we formulate the probelm as a cooperative hierarchical Markov Game (HMG) and establish a theoretical guarantee of Reg z ( T ) 𝒪 ~ ( H 3 T K / η ) subscript Reg 𝑧 𝑇 ~ 𝒪 superscript 𝐻 3 𝑇 𝐾 𝜂 {\rm Reg}_{z}(T)\leq\tilde{\mathcal{O}}(\sqrt{H^{3}TK/\eta}) under the perfect setting (see Corollary B.4 ). These two extention correponds to recent works on LLM planning as world model (e.g., Hu and Shu,, 2023 ) and muti-agent collaboration of LLM Agents (e.g., Mandi et al.,, 2023 ) .

6 Conclusion

In this work, we embedded the LLM-empowered decision-making problem into a hierarchical RL framework named PAR system where at the high level, the LLM Planner decomposes the user-specified task into subgoals, and at the low level, the Actor(s) translate the linguistic subgoals into physical realizations while also providing feedbacks for augmenting the planning process through a trained reporter. Under the perfect setting, we characterize the BAIL nature of the LLM-aided planning pipeline and the nessecity of exploration even under expert guidance. We also shed light on how the training errors of both LLM and reporter enter the ICL error under practical scenarios.

References

Appendix for “From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems”

Appendix A Additional Background

In this appendix, we present the additional background knowledge that are omitted due to the space limit. We first lay out the notations used in this paper.

Notations.

For some n + 𝑛 superscript n\in\mathbb{N}^{+} , let [ n ] = { 1 , , n } delimited-[] 𝑛 1 𝑛 [n]=\{1,\dots,n\} . Denote Δ ( 𝒳 ) Δ 𝒳 \Delta(\mathcal{X}) as the probability simplex over 𝒳 𝒳 \mathcal{X} . Consider two non-negative sequence { a n } n 0 subscript subscript 𝑎 𝑛 𝑛 0 \{a_{n}\}_{n\geq 0} and { b n } n 0 subscript subscript 𝑏 𝑛 𝑛 0 \{b_{n}\}_{n\geq 0} , if lim sup a n / b n < limit-supremum subscript 𝑎 𝑛 subscript 𝑏 𝑛 \limsup a_{n}/b_{n}<\infty , we write it as a n = 𝒪 ( b n ) subscript 𝑎 𝑛 𝒪 subscript 𝑏 𝑛 a_{n}=\mathcal{O}(b_{n}) and use 𝒪 ~ ~ 𝒪 \tilde{\mathcal{O}} to omit logarithmic terms. Else if lim inf a n / b n < limit-infimum subscript 𝑎 𝑛 subscript 𝑏 𝑛 \liminf a_{n}/b_{n}<\infty , we write a n = Ω ( b n ) subscript 𝑎 𝑛 Ω subscript 𝑏 𝑛 a_{n}=\Omega(b_{n}) . For continuum 𝒮 𝒮 \mathcal{S} , denote | 𝒮 | 𝒮 |\mathcal{S}| as the cardinality. For matrix X m × n 𝑋 superscript 𝑚 𝑛 X\in\mathbb{R}^{m\times n} , the p , q subscript 𝑝 𝑞 \ell_{p,q} -norm is defined as X p , q = ( i = 1 n X : , i p q ) 1 / q subscript norm 𝑋 𝑝 𝑞 superscript superscript subscript 𝑖 1 𝑛 superscript subscript norm subscript 𝑋 : 𝑖 𝑝 𝑞 1 𝑞 \|X\|_{p,q}=(\sum_{i=1}^{n}\|X_{:,i}\|_{p}^{q})^{1/q} , where X : , i subscript 𝑋 : 𝑖 X_{:,i} denotes the i 𝑖 i -th column of X 𝑋 X .

Table 1: Table of Notations.
Notation Meaning
𝒥 z ( , ) subscript 𝒥 𝑧 \mathcal{J}_{z}(\cdot,\cdot) , π z ( ) superscript subscript 𝜋 𝑧 \pi_{z}^{*}(\cdot) value function and optimal policy π z ( ) := argmax π 𝒥 ( π , ) assign superscript subscript 𝜋 𝑧 subscript argmax 𝜋 𝒥 𝜋 \pi_{z}^{*}(\cdot):=\mathop{\mathrm{argmax}}_{\pi}\mathcal{J}(\pi,\cdot) concerning ground-truth 𝕆 𝕆 \mathbb{O}
𝒥 ^ z ( , ) subscript ^ 𝒥 𝑧 \widehat{\mathcal{J}}_{z}(\cdot,\cdot) , π ^ z ( ) superscript subscript ^ 𝜋 𝑧 \widehat{\pi}_{z}^{*}(\cdot) value function and optimal policy π ^ z ( ) := argmax π 𝒥 ^ ( π , ) assign superscript subscript ^ 𝜋 𝑧 subscript argmax 𝜋 ^ 𝒥 𝜋 \widehat{\pi}_{z}^{*}(\cdot):=\mathop{\mathrm{argmax}}_{\pi}\widehat{\mathcal{J}}(\pi,\cdot) concerning pretrained 𝕆 γ ^ subscript 𝕆 ^ 𝛾 \mathbb{O}_{\widehat{\gamma}}
𝒟 ( ) subscript 𝒟 \mathbb{P}_{\mathcal{D}}(\cdot) , 𝒞 ( ) subscript 𝒞 \mathbb{P}_{\mathcal{C}}(\cdot) probability induced by the distribution of joint and contrastive data collection
π h , 𝙻𝙻𝙼 t subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 \pi^{t}_{h,\mathtt{LLM}} , π ^ h , 𝙻𝙻𝙼 t subscript superscript ^ 𝜋 𝑡 𝙻𝙻𝙼 \widehat{\pi}^{t}_{h,\mathtt{LLM}} π h , 𝙻𝙻𝙼 t ( | τ h t , ω t ) := 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) \pi^{t}_{h,\mathtt{LLM}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}):=\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}) and π ^ h , 𝙻𝙻𝙼 t ( | τ h t , ω t ) := 𝙻𝙻𝙼 θ ^ ( | 𝚙𝚝 h t ) \widehat{\pi}^{t}_{h,\mathtt{LLM}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}):=\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}) at step h h
z ( ) subscript 𝑧 \mathbb{P}_{z}(\cdot) , ^ z ( ) subscript ^ 𝑧 \widehat{\mathbb{P}}_{z}(\cdot) probability under environment featured by z 𝑧 z , ground-truth 𝕆 𝕆 \mathbb{O} or pretrained 𝕆 γ ^ subscript 𝕆 ^ 𝛾 \mathbb{O}_{\widehat{\gamma}}
z π ( ) superscript subscript 𝑧 𝜋 \mathbb{P}_{z}^{\pi}(\cdot) , ^ z π ( ) superscript subscript ^ 𝑧 𝜋 \widehat{\mathbb{P}}_{z}^{\pi}(\cdot) probability under environment featured by z 𝑧 z , policy π 𝜋 \pi , ground-truth 𝕆 𝕆 \mathbb{O} or pretrained 𝕆 γ ^ subscript 𝕆 ^ 𝛾 \mathbb{O}_{\widehat{\gamma}}
𝒫 Ω ( ) subscript 𝒫 Ω \mathcal{P}_{\Omega}(\cdot) , 𝒫 𝒵 ( ) subscript 𝒫 𝒵 \mathcal{P}_{\mathcal{Z}}(\cdot) prior distributions of high-level tasks and latent variables
τ ˘ h / t i superscript subscript ˘ 𝜏 𝑡 𝑖 \breve{\tau}_{h/t}^{i} τ ˘ h / t i = τ H superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝜏 𝐻 \breve{\tau}_{h/t}^{i}={\tau}_{H} for all i < t 𝑖 𝑡 i<t and τ ˘ h / t t = τ h superscript subscript ˘ 𝜏 𝑡 𝑡 subscript 𝜏 \breve{\tau}_{h/t}^{t}={\tau}_{h}
z ( | , 𝐝𝐨 ) \mathbb{P}_{z}(\cdot|\cdot,\mathrm{\bf do}\hskip 1.70717pt\cdot) z ( | o 1 , 𝐝𝐨 g 1 : h 1 ) := o 2 : h 1 h = 1 h 1 z ( o h + 1 | ( o , g ) 1 : h ) d o 2 : h 1 \mathbb{P}_{z}(\cdot\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}):=\int_{o_{2:h-1}}\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{z}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right){\mathrm{d}}o_{2:h-1}
𝙻𝙻𝙼 t ( | , 𝐝𝐨 ) \mathbb{P}_{\mathtt{LLM}}^{t}(\cdot|\cdot,\mathrm{\bf do}\hskip 1.70717pt\cdot) 𝙻𝙻𝙼 t ( | o 1 , 𝐝𝐨 g 1 : h 1 ) := o 2 : h 1 h = 1 h 1 𝒟 ( o h + 1 | ( o , g ) 1 : h , t ) d o 2 : h 1 \mathbb{P}_{\mathtt{LLM}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}):=\int_{o_{2:h-1}}\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{\mathcal{D}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right){\mathrm{d}}o_{2:h-1}
𝒥 ^ t , 𝙻𝙻𝙼 ( , ) , π ^ 𝙻𝙻𝙼 t , ( ) subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 \mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\cdot,\cdot),\widehat{\pi}_{\mathtt{LLM}}^{t,*}(\cdot) value function of environment simulated by 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} and π ^ 𝙻𝙻𝙼 t , ( ) := argmax π 𝒥 ^ t , 𝙻𝙻𝙼 ( π , ) assign superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 subscript argmax 𝜋 subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 𝜋 \widehat{\pi}_{\mathtt{LLM}}^{t,*}(\cdot):=\mathop{\mathrm{argmax}}_{\pi}\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\pi,\cdot)
𝒥 t , 𝙻𝙻𝙼 ( , ) , π 𝙻𝙻𝙼 t , ( ) subscript 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript 𝜋 𝙻𝙻𝙼 𝑡 \mathcal{J}_{t,\mathtt{LLM}}(\cdot,\cdot),\pi_{\mathtt{LLM}}^{t,*}(\cdot) value function of environment simulated by perfect 𝙻𝙻𝙼 𝙻𝙻𝙼 \mathtt{LLM} and π 𝙻𝙻𝙼 t , ( ) := argmax π 𝒥 t , 𝙻𝙻𝙼 ( π , ) assign superscript subscript 𝜋 𝙻𝙻𝙼 𝑡 subscript argmax 𝜋 subscript 𝒥 𝑡 𝙻𝙻𝙼 𝜋 {\pi}_{\mathtt{LLM}}^{t,*}(\cdot):=\mathop{\mathrm{argmax}}_{\pi}\mathcal{{J}}_{t,\mathtt{LLM}}(\pi,\cdot)
𝙻𝙻𝙼 t ( ) superscript subscript 𝙻𝙻𝙼 𝑡 \mathbb{P}_{\mathtt{LLM}}^{t}(\cdot) , ^ 𝙻𝙻𝙼 t ( ) superscript subscript ^ 𝙻𝙻𝙼 𝑡 \widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}(\cdot) probability of environment simulated by perfect 𝙻𝙻𝙼 𝙻𝙻𝙼 \mathtt{LLM} or pretrained 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} with t subscript 𝑡 \mathcal{H}_{t}
D TV ( P , Q ) subscript 𝐷 TV 𝑃 𝑄 D_{\rm TV}(P,Q) total variation distance, D TV ( P , Q ) := 1 / 2 𝔼 x P [ | d Q ( x ) / d P ( x ) 1 | ] assign subscript 𝐷 TV 𝑃 𝑄 1 2 subscript 𝔼 similar-to 𝑥 𝑃 delimited-[] d 𝑄 𝑥 d 𝑃 𝑥 1 D_{\rm TV}(P,Q):=1/2\cdot\mathbb{E}_{x\sim P}[|{\rm d}Q(x)/{\rm d}P(x)-1|]
D H 2 ( P , Q ) superscript subscript 𝐷 H 2 𝑃 𝑄 D_{\rm H}^{2}(P,Q) Helliger distance, D H 2 ( P , Q ) := 1 / 2 𝔼 x P [ ( d Q ( x ) / d P ( x ) 1 ) 2 ] assign subscript superscript 𝐷 2 H 𝑃 𝑄 1 2 subscript 𝔼 similar-to 𝑥 𝑃 delimited-[] superscript d 𝑄 𝑥 d 𝑃 𝑥 1 2 D^{2}_{\rm H}(P,Q):=1/2\cdot\mathbb{E}_{x\sim P}\big{[}\big{(}\sqrt{{\rm d}Q(x)/{\rm d}P(x)}-1\big{)}^{2}\big{]}
D KL ( P , Q ) subscript 𝐷 KL 𝑃 𝑄 D_{\rm KL}(P,Q) KL divergence, D KL ( P Q ) := 𝔼 x P [ log d P ( x ) / d Q ( x ) ] assign subscript 𝐷 KL conditional 𝑃 𝑄 subscript 𝔼 similar-to 𝑥 𝑃 delimited-[] d 𝑃 𝑥 d 𝑄 𝑥 D_{\rm KL}(P\|Q):=\mathbb{E}_{x\sim P}[\log{\mathrm{d}}P(x)/{\mathrm{d}}Q(x)]
χ 2 ( P , Q ) superscript 𝜒 2 𝑃 𝑄 \chi^{2}(P,Q) χ 2 superscript 𝜒 2 \chi^{2} -divergence, χ 2 ( P Q ) := 𝔼 x P [ ( d Q ( x ) / d P ( x ) 1 ) 2 ] assign superscript 𝜒 2 conditional 𝑃 𝑄 subscript 𝔼 similar-to 𝑥 𝑃 delimited-[] superscript d 𝑄 𝑥 d 𝑃 𝑥 1 2 \chi^{2}(P\|Q):=\mathbb{E}_{x\sim P}[({\mathrm{d}}Q(x)/{\mathrm{d}}P(x)-1)^{2}]
𝔼 ^ 𝒟 [ f ] subscript ^ 𝔼 𝒟 delimited-[] 𝑓 \widehat{\mathbb{E}}_{\mathcal{D}}[f] 𝔼 ¯ [ f ] := 1 / n t = 1 n f ( t ) assign ¯ 𝔼 delimited-[] 𝑓 1 𝑛 superscript subscript 𝑡 1 𝑛 𝑓 subscript 𝑡 \bar{\mathbb{E}}[f]:=1/n\cdot\sum_{t=1}^{n}f(\ell_{t}) given dataset 𝒟 = { t } t [ n ] 𝒟 subscript subscript 𝑡 𝑡 delimited-[] 𝑛 \mathcal{D}=\{\ell_{t}\}_{t\in[n]}
¯ 𝒟 ( ) subscript ¯ 𝒟 \bar{\mathbb{P}}_{\mathcal{D}}(\cdot) , 𝔼 ¯ 𝒟 [ f ] subscript ¯ 𝔼 𝒟 delimited-[] 𝑓 \bar{\mathbb{E}}_{\mathcal{D}}[f] ¯ 𝒟 ( ) := n = 1 N t = 0 T 1 𝒟 ( | 1 : t n ) / N T \bar{\mathbb{P}}_{\mathcal{D}}(\cdot):=\sum_{n=1}^{N}\sum_{t=0}^{T-1}\mathbb{P}_{\mathcal{D}}(\cdot|\ell_{1:t}^{n})/NT and 𝔼 ¯ [ f ] := 𝔼 ¯ 𝒟 [ f ( ) ] assign ¯ 𝔼 delimited-[] 𝑓 subscript 𝔼 similar-to subscript ¯ 𝒟 delimited-[] 𝑓 \bar{\mathbb{E}}[f]:=\mathbb{E}_{\ell\sim\bar{\mathbb{P}}_{\mathcal{D}}}[f(\ell)] given 𝒟 = { 1 : T n } n [ N ] 𝒟 subscript superscript subscript : 1 𝑇 𝑛 𝑛 delimited-[] 𝑁 \mathcal{D}=\{\ell_{1:T}^{n}\}_{n\in[N]}

A.1 Hierarchical Markov Decision Process

In this subsection, we present a formalized definition of the HMDP model introduced in § 3.1 .

Low-level MDP .

Define 𝒢 𝒢 \mathcal{G} as the space of high-level actions. For fixed g 𝒢 𝑔 𝒢 g\in\mathcal{G} and high-level step h [ H ] delimited-[] 𝐻 h\in[H] , the low-level MDP is defined as h ( g ) = ( 𝒮 , 𝒜 , H a , 𝕋 h , r ¯ g ) subscript 𝑔 𝒮 𝒜 subscript 𝐻 𝑎 subscript 𝕋 subscript ¯ 𝑟 𝑔 \mathcal{M}_{h}(g)=(\mathcal{S},\mathcal{A},H_{a},\mathbb{T}_{h},\bar{r}_{g}) , where 𝒮 𝒮 \mathcal{S} is the state space, 𝒜 𝒜 \mathcal{A} is the low-level action space, H a subscript 𝐻 𝑎 H_{a} is the number of steps, 𝕋 h = { 𝕋 h , h ¯ } h ¯ [ H a ] subscript 𝕋 subscript subscript 𝕋 ¯ ¯ delimited-[] subscript 𝐻 𝑎 \mathbb{T}_{h}=\{\mathbb{T}_{h,\bar{h}}\}_{\bar{h}\in[H_{a}]} is the transition kernel, and r ¯ = { r ¯ h ¯ } h ¯ [ H a ] ¯ 𝑟 subscript subscript ¯ 𝑟 ¯ ¯ delimited-[] subscript 𝐻 𝑎 \bar{r}=\{\bar{r}_{\bar{h}}\}_{\bar{h}\in[H_{a}]} is the reward function with r ¯ h ¯ : 𝒮 × 𝒜 × 𝒢 : subscript ¯ 𝑟 ¯ maps-to 𝒮 𝒜 𝒢 \bar{r}_{\bar{h}}:{\mathcal{S}}\times\mathcal{A}\times\mathcal{G}\mapsto\mathbb{R} . The low-level agent follows policy μ = { μ g } g 𝒢 𝜇 subscript subscript 𝜇 𝑔 𝑔 𝒢 \mu=\{\mu_{g}\}_{g\in\mathcal{G}} , where μ g = { μ h ¯ } h ¯ [ H a ] subscript 𝜇 𝑔 subscript subscript 𝜇 ¯ ¯ delimited-[] subscript 𝐻 𝑎 \mu_{g}=\{\mu_{\bar{h}}\}_{\bar{h}\in[H_{a}]} and μ h ¯ : 𝒮 × 𝒢 Δ ( 𝒜 ) : subscript 𝜇 ¯ maps-to 𝒮 𝒢 Δ 𝒜 \mu_{\bar{h}}:{\mathcal{S}}\times\mathcal{G}\mapsto\Delta(\mathcal{A}) .

High-level POMDP .

Define Ω Ω \Omega be the space of disclosed variables, and we write z = ( 𝕋 , μ ) 𝑧 𝕋 𝜇 z=(\mathbb{T},\mu) to feature the low-level environment. Each low-level episode corresponds to a single high-level action. Given fixed pair ( z , ω ) 𝒵 × Ω 𝑧 𝜔 𝒵 Ω (z,\omega)\in\mathcal{Z}\times\Omega , the POMDP is characterized by 𝒲 ( z , ω ) = ( 𝒮 , 𝒪 , 𝒢 , H , z , r ω ) 𝒲 𝑧 𝜔 𝒮 𝒪 𝒢 𝐻 subscript 𝑧 subscript 𝑟 𝜔 \mathcal{W}(z,\omega)=(\mathcal{S},\mathcal{O},\mathcal{G},H,\mathbb{P}_{z},r_{\omega}) , where 𝒪 𝒪 \mathcal{O} is the observation space, 𝕆 = { 𝕆 h } h [ H ] 𝕆 subscript subscript 𝕆 delimited-[] 𝐻 \mathbb{O}=\{\mathbb{O}_{h}\}_{h\in[H]} is the emission distribution with 𝕆 h : 𝒪 Δ ( 𝒮 ) : subscript 𝕆 maps-to 𝒪 Δ 𝒮 \mathbb{O}_{h}:\mathcal{O}\mapsto\Delta({\mathcal{S}}) , r = { r h } h [ H ] 𝑟 subscript subscript 𝑟 delimited-[] 𝐻 r=\{r_{h}\}_{h\in[H]} is the reward function with r h : 𝒪 × Ω : subscript 𝑟 maps-to 𝒪 Ω r_{h}:\mathcal{O}\times\Omega\mapsto\mathbb{R} , and z = { z , h } h [ H ] subscript 𝑧 subscript subscript 𝑧 delimited-[] 𝐻 \mathbb{P}_{z}=\{\mathbb{P}_{z,h}\}_{h\in[H]} is the high-level transition kernel following that

z , h ( s | s , g ) = ( s ¯ h , H a + 1 = s | s ¯ h , 1 = s , a h , 1 : h ¯ μ g , s ¯ h , 2 : h ¯ + 1 𝕋 h ) , \displaystyle\mathbb{P}_{z,h}(s^{\prime}\hskip 1.42262pt|\hskip 1.42262pts,g)=\mathbb{P}\big{(}\bar{s}_{h,H_{a}+1}=s^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\bar{s}_{h,1}=s,a_{h,1:\bar{h}}\sim\mu_{g},\bar{s}_{h,2:\bar{h}+1}\sim\mathbb{T}_{h}\big{)},

for all h [ H ] delimited-[] 𝐻 h\in[H] . The space of state 𝒮 𝒮 \mathcal{S} and latent variable z 𝑧 z are inherited from the low-level MDP .

Please refer to Figure 2 for the interactive protocol of HMDP . Furthermore, for the high-level POMDP, the state value function of policy π 𝜋 \pi , the state value function is defined as

V z , h π ( s , τ , ω ) = 𝔼 π [ h = h H r h ( o h , ω ) | s h = s , τ h = τ ] , superscript subscript 𝑉 𝑧 𝜋 𝑠 𝜏 𝜔 subscript 𝔼 𝜋 delimited-[] formulae-sequence conditional superscript subscript superscript 𝐻 subscript 𝑟 superscript subscript 𝑜 superscript 𝜔 subscript 𝑠 𝑠 subscript 𝜏 𝜏 V_{z,h}^{\pi}(s,\tau,\omega)=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(o_{h^{\prime}},\omega)\Big{|}s_{h}=s,\tau_{h}=\tau\right], (A.1)

where trajectory τ h ( 𝒪 × 𝒢 ) h 1 × 𝒪 subscript 𝜏 superscript 𝒪 𝒢 1 𝒪 \tau_{h}\in(\mathcal{O}\times\mathcal{G})^{h-1}\times\mathcal{O} , and similarly we define the state-action value function as

Q z , h π ( s , τ , g , ω ) = 𝔼 π [ h = h H r h ( o h , ω ) | s h = s , τ h = τ , g h = g ] , superscript subscript 𝑄 𝑧 𝜋 𝑠 𝜏 𝑔 𝜔 subscript 𝔼 𝜋 delimited-[] formulae-sequence conditional superscript subscript superscript 𝐻 subscript 𝑟 superscript subscript 𝑜 superscript 𝜔 subscript 𝑠 𝑠 formulae-sequence subscript 𝜏 𝜏 subscript 𝑔 𝑔 Q_{z,h}^{\pi}(s,\tau,g,\omega)=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(o_{h^{\prime}},\omega)\Big{|}s_{h}=s,\tau_{h}=\tau,g_{h}=g\right], (A.2)

where expectation is taken concerning the policy π 𝜋 \pi , transition kernel z subscript 𝑧 \mathbb{P}_{z} , and emission distribution 𝕆 𝕆 \mathbb{O} . Besides, for all h [ H ] delimited-[] 𝐻 h\in[H] , denote the probability of observing trajectory τ h subscript 𝜏 \tau_{h} under policy π 𝜋 \pi as

z π ( τ h ) = π ( τ h ) z ( τ h ) , z ( τ h ) = h = 1 h 1 ( o h + 1 | τ h , g h ) , π ( τ h ) = h = 1 h 1 π h ( g h | τ h ) , formulae-sequence superscript subscript 𝑧 𝜋 subscript 𝜏 𝜋 subscript 𝜏 subscript 𝑧 subscript 𝜏 formulae-sequence subscript 𝑧 subscript 𝜏 superscript subscript product superscript 1 1 conditional subscript 𝑜 superscript 1 subscript 𝜏 superscript subscript 𝑔 superscript 𝜋 subscript 𝜏 superscript subscript product superscript 1 1 subscript 𝜋 conditional subscript 𝑔 superscript subscript 𝜏 superscript \displaystyle\mathbb{P}_{z}^{\pi}(\tau_{h})=\pi(\tau_{h})\cdot\mathbb{P}_{z}(\tau_{h}),\quad\mathbb{P}_{z}(\tau_{h})=\prod_{h^{\prime}=1}^{h-1}\mathbb{P}(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h^{\prime}},g_{h^{\prime}}),\quad\pi(\tau_{h})=\prod_{h^{\prime}=1}^{h-1}\pi_{h}(g_{h^{\prime}}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h^{\prime}}), (A.3)

where z ( τ h ) subscript 𝑧 subscript 𝜏 \mathbb{P}_{z}(\tau_{h}) denotes the part of the probability of τ h subscript 𝜏 \tau_{h} that is incurred by the dynamic environment independent of policies, π ( τ h ) 𝜋 subscript 𝜏 \pi(\tau_{h}) denotes the part that can be attributed to the randomness of policy.

A.2 LLM Pretraining under Transformer Architecture

Transformer and Attention Mechanism.

Consider a sequence of N 𝑁 N input vectors { 𝐡 i } i = 1 n d superscript subscript subscript 𝐡 𝑖 𝑖 1 𝑛 superscript 𝑑 \{\mathbf{h}_{i}\}_{i=1}^{n}\subset\mathbb{R}^{d} , written as an input matrix 𝐇 = [ 𝐡 1 , , 𝐡 n ] n × d 𝐇 superscript subscript 𝐡 1 subscript 𝐡 𝑛 top superscript 𝑛 𝑑 \mathbf{H}=[\mathbf{h}_{1},\dots,\mathbf{h}_{n}]^{\top}\in\mathbb{R}^{n\times d} , where each 𝐡 i subscript 𝐡 𝑖 \mathbf{h}_{i} is a row of 𝐇 𝐇 \mathbf{H} (also a token). Consider 𝐊 n s × d 𝐊 superscript subscript 𝑛 𝑠 𝑑 \mathbf{K}\in\mathbb{R}^{n_{s}\times d} and 𝐕 n s × d s 𝐕 superscript subscript 𝑛 𝑠 subscript 𝑑 𝑠 \mathbf{V}\in\mathbb{R}^{n_{s}\times d_{s}} , then the (softmax) attention mechanism maps these input vectors using the function 𝚊𝚝𝚝𝚗 ( 𝐇 , 𝐊 , 𝐕 ) = 𝚂𝚘𝚏𝚝𝚖𝚊𝚡 ( 𝐇𝐊 ) 𝐕 n × d s 𝚊𝚝𝚝𝚗 𝐇 𝐊 𝐕 𝚂𝚘𝚏𝚝𝚖𝚊𝚡 superscript 𝐇𝐊 top 𝐕 superscript 𝑛 subscript 𝑑 𝑠 \mathtt{attn}(\mathbf{H},\mathbf{K},\mathbf{V})=\mathtt{Softmax}(\mathbf{H}\mathbf{K}^{\top})\mathbf{V}\in\mathbb{R}^{n\times d_{s}} , where softmax function is applied row-wisely and normalize each vector via the exponential function such that [ 𝚂𝚘𝚏𝚝𝚖𝚊𝚡 ( 𝐡 ) ] i = exp ( 𝐡 i ) / j = 1 d exp ( 𝐡 j ) subscript delimited-[] 𝚂𝚘𝚏𝚝𝚖𝚊𝚡 𝐡 𝑖 subscript 𝐡 𝑖 superscript subscript 𝑗 1 𝑑 subscript 𝐡 𝑗 [\mathtt{Softmax}(\mathbf{h})]_{i}=\exp(\mathbf{h}_{i})/\sum_{j=1}^{d}\exp(\mathbf{h}_{j}) for all i [ d ] 𝑖 delimited-[] 𝑑 i\in[d] . To approximate sophisticated functions, practitioners use Multi-head Attention (MHA) instead, which forwards the input vectors into h h attention modules in parallel with h h\in\mathbb{N} as a hyperparameter and outputs the sum of these sub-modules. Denote 𝐖 = { ( 𝐖 i H , 𝐖 i K , 𝐖 i V ) } i = 1 h 𝐖 superscript subscript subscript superscript 𝐖 𝐻 𝑖 subscript superscript 𝐖 𝐾 𝑖 subscript superscript 𝐖 𝑉 𝑖 𝑖 1 \mathbf{W}=\{(\mathbf{W}^{H}_{i},\mathbf{W}^{K}_{i},\mathbf{W}^{V}_{i})\}_{i=1}^{h} as the set of weight matrices, the MHA outputs 𝙼𝚑𝚊 ( 𝐇 , 𝐖 ) = i = 1 h 𝚊𝚝𝚝𝚗 ( 𝐇𝐖 i H , 𝐇𝐖 i K , 𝐇𝐖 i V ) 𝙼𝚑𝚊 𝐇 𝐖 superscript subscript 𝑖 1 𝚊𝚝𝚝𝚗 subscript superscript 𝐇𝐖 𝐻 𝑖 subscript superscript 𝐇𝐖 𝐾 𝑖 subscript superscript 𝐇𝐖 𝑉 𝑖 \mathtt{Mha}(\mathbf{H},\mathbf{W})=\sum_{i=1}^{h}\mathtt{attn}(\mathbf{H}\mathbf{W}^{H}_{i},\mathbf{H}\mathbf{W}^{K}_{i},\mathbf{H}\mathbf{W}^{V}_{i}) , where 𝐖 i H d × d h subscript superscript 𝐖 𝐻 𝑖 superscript 𝑑 subscript 𝑑 \mathbf{W}^{H}_{i}\in\mathbb{R}^{d\times d_{h}} , 𝐖 i K d × d h subscript superscript 𝐖 𝐾 𝑖 superscript 𝑑 subscript 𝑑 \mathbf{W}^{K}_{i}\in\mathbb{R}^{d\times d_{h}} and 𝐖 i V d × d subscript superscript 𝐖 𝑉 𝑖 superscript 𝑑 𝑑 \mathbf{W}^{V}_{i}\in\mathbb{R}^{d\times d} for all i [ h ] 𝑖 delimited-[] i\in[h] , and d h subscript 𝑑 d_{h} is usually set to d / h 𝑑 d/h (Michel et al.,, 2019 ) . Based on the definitions above, we are ready to present the transformer architecture employed in LLMs like BERT and GPT (Devlin et al.,, 2018 ; Brown et al.,, 2020 ) . Detailedly, the transformer network has D 𝐷 D sub-modules, consisting of an MHA and Feed-Forward (FF) fully-connected layer. Given input matrix 𝐇 ( 0 ) = 𝐇 n × d superscript 𝐇 0 𝐇 superscript 𝑛 𝑑 \mathbf{H}^{(0)}=\mathbf{H}\in\mathbb{R}^{n\times d} , in the j 𝑗 j -th layer for j [ D ] 𝑗 delimited-[] 𝐷 j\in[D] , it first takes the output from the ( t 1 ) 𝑡 1 (t-1) -th layer 𝐇 ( t 1 ) superscript 𝐇 𝑡 1 \mathbf{H}^{(t-1)} as the input matrix, and forwards it to the MHA module with a projection function 𝙿𝚛𝚘𝚓 [ ] 𝙿𝚛𝚘𝚓 delimited-[] \mathtt{Proj}[\cdot] and a residual link. After receiving intermediate 𝐇 ¯ ( t ) n × d superscript ¯ 𝐇 𝑡 superscript 𝑛 𝑑 \overline{\mathbf{H}}^{(t)}\in\mathbb{R}^{n\times d} , the FF module maps each row through a same single-hidden layer neural network with d F subscript 𝑑 𝐹 d_{F} neurons such that 𝚁𝚎𝙻𝚄 ( 𝐇 ¯ ( t ) 𝐀 1 ( t ) ) 𝐀 2 ( t ) 𝚁𝚎𝙻𝚄 superscript ¯ 𝐇 𝑡 superscript subscript 𝐀 1 𝑡 superscript subscript 𝐀 2 𝑡 \mathtt{ReLU}(\overline{\mathbf{H}}^{(t)}\mathbf{A}_{1}^{(t)})\mathbf{A}_{2}^{(t)} , where 𝐀 1 ( t ) d × d F superscript subscript 𝐀 1 𝑡 superscript 𝑑 subscript 𝑑 𝐹 \mathbf{A}_{1}^{(t)}\in\mathbb{R}^{d\times d_{F}} , 𝐀 2 ( t ) d F × d superscript subscript 𝐀 2 𝑡 superscript subscript 𝑑 𝐹 𝑑 \mathbf{A}_{2}^{(t)}\in\mathbb{R}^{d_{F}\times d} , and [ 𝚁𝚎𝙻𝚄 ( 𝐗 ) ] i , j = max { 𝐗 i , j , 0 } subscript delimited-[] 𝚁𝚎𝙻𝚄 𝐗 𝑖 𝑗 subscript 𝐗 𝑖 𝑗 0 [\mathtt{ReLU}(\mathbf{X})]_{i,j}=\max\{\mathbf{X}_{i,j},0\} . Specifically, the output of the t 𝑡 t -th layer with t [ D ] 𝑡 delimited-[] 𝐷 t\in[D] can be summarized as below:

𝐇 ¯ ( t ) = 𝙿𝚛𝚘𝚓 [ 𝙼𝚑𝚊 ( 𝐇 ( t 1 ) , 𝐖 ( t ) ) + γ 1 ( t ) 𝐇 ( t 1 ) ] , 𝐇 ( t ) = 𝙿𝚛𝚘𝚓 [ 𝚁𝚎𝙻𝚄 ( 𝐇 ¯ ( t ) 𝐀 1 ( t ) ) 𝐀 2 ( t ) + γ 2 ( t ) 𝐇 ¯ ( t ) ] , formulae-sequence superscript ¯ 𝐇 𝑡 𝙿𝚛𝚘𝚓 delimited-[] 𝙼𝚑𝚊 superscript 𝐇 𝑡 1 superscript 𝐖 𝑡 superscript subscript 𝛾 1 𝑡 superscript 𝐇 𝑡 1 superscript 𝐇 𝑡 𝙿𝚛𝚘𝚓 delimited-[] 𝚁𝚎𝙻𝚄 superscript ¯ 𝐇 𝑡 superscript subscript 𝐀 1 𝑡 superscript subscript 𝐀 2 𝑡 superscript subscript 𝛾 2 𝑡 superscript ¯ 𝐇 𝑡 \displaystyle\overline{\mathbf{H}}^{(t)}=\mathtt{Proj}\left[\mathtt{Mha}\left({\mathbf{H}}^{(t-1)},\mathbf{W}^{(t)}\right)+\gamma_{1}^{(t)}{\mathbf{H}}^{(t-1)}\right],\quad{\mathbf{H}}^{(t)}=\mathtt{Proj}\left[\mathtt{ReLU}(\overline{\mathbf{H}}^{(t)}\mathbf{A}_{1}^{(t)})\mathbf{A}_{2}^{(t)}+\gamma_{2}^{(t)}\overline{\mathbf{H}}^{(t)}\right],

where γ 1 ( t ) superscript subscript 𝛾 1 𝑡 \gamma_{1}^{(t)} and γ 2 ( t ) superscript subscript 𝛾 2 𝑡 \gamma_{2}^{(t)} features the allocation of residual link. The final output of the transformer is the probability of the next token via a softmax distribution such that

𝐇 ( D + 1 ) = 𝚂𝚘𝚏𝚝𝚖𝚊𝚡 ( 𝟏 𝐇 ( D ) 𝐀 ( D + 1 ) / N γ ( D + 1 ) ) , superscript 𝐇 𝐷 1 𝚂𝚘𝚏𝚝𝚖𝚊𝚡 superscript 1 top superscript 𝐇 𝐷 superscript 𝐀 𝐷 1 𝑁 superscript 𝛾 𝐷 1 {\mathbf{H}}^{(D+1)}=\mathtt{Softmax}\left(\mathbf{1}^{\top}{\mathbf{H}}^{(D)}\mathbf{A}^{(D+1)}/N\gamma^{(D+1)}\right),

where 𝐀 ( D + 1 ) d × d E superscript 𝐀 𝐷 1 superscript 𝑑 subscript 𝑑 𝐸 \mathbf{A}^{(D+1)}\in\mathbb{R}^{d\times d_{E}} denotes the weight matrix with dimension d E subscript 𝑑 𝐸 d_{E}\in\mathbb{N} and γ ( D + 1 ) ( 0 , 1 ] superscript 𝛾 𝐷 1 0 1 \gamma^{(D+1)}\in(0,1] is the fixed temperature parameter. Let 𝜽 ( t ) = ( 𝐖 ( t ) , 𝐀 ( t ) , 𝜸 ( t ) ) superscript 𝜽 𝑡 superscript 𝐖 𝑡 superscript 𝐀 𝑡 superscript 𝜸 𝑡 \bm{\theta}^{(t)}=\left(\mathbf{W}^{(t)},\mathbf{A}^{(t)},\bm{\gamma}^{(t)}\right) for all t [ D ] 𝑡 delimited-[] 𝐷 t\in[D] , where 𝐀 ( t ) = ( 𝐀 1 ( t ) , 𝐀 2 ( t ) ) superscript 𝐀 𝑡 subscript superscript 𝐀 𝑡 1 subscript superscript 𝐀 𝑡 2 \mathbf{A}^{(t)}=(\mathbf{A}^{(t)}_{1},\mathbf{A}^{(t)}_{2}) and 𝜸 ( t ) = ( γ 1 ( t ) , γ 2 ( t ) ) superscript 𝜸 𝑡 superscript subscript 𝛾 1 𝑡 superscript subscript 𝛾 2 𝑡 \bm{\gamma}^{(t)}=(\gamma_{1}^{(t)},\gamma_{2}^{(t)}) , and denote 𝜽 ( D + 1 ) = ( 𝐀 ( D + 1 ) , γ ) superscript 𝜽 𝐷 1 superscript 𝐀 𝐷 1 𝛾 \bm{\theta}^{(D+1)}=(\mathbf{A}^{(D+1)},\gamma) . Hence, the parameter of the whole transformer architecture is the concatenation of parameters in each layer such that 𝜽 = ( 𝜽 ( 1 ) , , 𝜽 ( D + 1 ) ) 𝜽 superscript 𝜽 1 superscript 𝜽 𝐷 1 \bm{\theta}=(\bm{\theta}^{(1)},\dots,\bm{\theta}^{(D+1)}) , and we consider a bounded parameter space, which is defined as below

𝚯 := { 𝜽 | \displaystyle\bm{\Theta}:=\{\bm{\theta}\hskip 1.42262pt|\hskip 1.42262pt 𝐀 1 ( t ) F B A , 1 , 𝐀 2 ( t ) F B A , 2 , 𝐀 ( D + 1 ) , 1 , 2 B A , | γ 1 ( t ) | 1 , | γ 2 ( t ) | 1 , formulae-sequence subscript norm subscript superscript 𝐀 𝑡 1 𝐹 subscript 𝐵 𝐴 1 formulae-sequence subscript norm subscript superscript 𝐀 𝑡 2 𝐹 subscript 𝐵 𝐴 2 formulae-sequence subscript norm superscript 𝐀 𝐷 1 top 1 2 subscript 𝐵 𝐴 formulae-sequence superscript subscript 𝛾 1 𝑡 1 superscript subscript 𝛾 2 𝑡 1 \displaystyle\|\mathbf{A}^{(t)}_{1}\|_{F}\leq B_{A,1},\|\mathbf{A}^{(t)}_{2}\|_{F}\leq B_{A,2},\|\mathbf{A}^{(D+1),\top}\|_{1,2}\leq B_{A},|\gamma_{1}^{(t)}|\leq 1,|\gamma_{2}^{(t)}|\leq 1,
| γ ( D + 1 ) | 1 , 𝐖 i H , ( t ) B H , 𝐖 i K , ( t ) B K , 𝐖 i V , ( t ) B V , ( i , t ) [ h ] × [ D ] } . \displaystyle|\gamma^{(D+1)}|\leq 1,\|\mathbf{W}_{i}^{H,(t)}\|\leq B_{H},\|\mathbf{W}_{i}^{K,(t)}\|\leq B_{K},\|\mathbf{W}_{i}^{V,(t)}\|\leq B_{V},\forall(i,t)\in[h]\times[D]\}.

To facilitate the expression of Theorem 2 , we further define D ¯ = D 2 d ( d h + d F + d ) + d E d ¯ 𝐷 superscript 𝐷 2 𝑑 subscript 𝑑 subscript 𝑑 𝐹 𝑑 subscript 𝑑 𝐸 𝑑 \bar{D}=D^{2}d\cdot(d_{h}+d_{F}+d)+d_{E}\cdot d and B ¯ = γ 1 R h B A , 1 B A , 2 B A B H B K B V ¯ 𝐵 superscript 𝛾 1 𝑅 subscript 𝐵 𝐴 1 subscript 𝐵 𝐴 2 subscript 𝐵 𝐴 subscript 𝐵 𝐻 subscript 𝐵 𝐾 subscript 𝐵 𝑉 \bar{B}=\gamma^{-1}RhB_{A,1}B_{A,2}B_{A}B_{H}B_{K}B_{V} , where R 𝑅 R is (almost surely) the upper bound of the magnitude of each token 𝔏 𝔏 \ell\in\mathfrak{L} in token sequence S t 𝔏 subscript 𝑆 𝑡 superscript 𝔏 S_{t}\in\mathfrak{L}^{*} , which is defined in Assumption 5.1 .

Markov Chains.

We follow the notations used in Paulin, ( 2015 ); Zhang et al., ( 2023 ) . Let Ω Ω \Omega be a Polish space. The transition kernel for a time-homogeneous Markov chain { X i } i = 1 superscript subscript subscript 𝑋 𝑖 𝑖 1 \{X_{i}\}_{i=1}^{\infty} supported on Ω Ω \Omega is a probability distribution ( x , y ) 𝑥 𝑦 \mathbb{P}(x,y) for every x Ω 𝑥 Ω x\in\Omega . Given X 1 = x 1 , , X t 1 = x t 1 formulae-sequence subscript 𝑋 1 subscript 𝑥 1 subscript 𝑋 𝑡 1 subscript 𝑥 𝑡 1 X_{1}=x_{1},\cdots,X_{t-1}=x_{t-1} , the conditional distribution of X t subscript 𝑋 𝑡 X_{t} equals ( x t 1 , y ) subscript 𝑥 𝑡 1 𝑦 \mathbb{P}(x_{t-1},y) . A distribution π 𝜋 \pi is said to be a stationary distribution of this Markov chain if x Ω ( x , y ) π ( x ) = π ( y ) subscript 𝑥 Ω 𝑥 𝑦 𝜋 𝑥 𝜋 𝑦 \int_{x\in\Omega}\mathbb{P}(x,y)\cdot\pi(x)=\pi(y) . We adopt t ( x , ) subscript 𝑡 𝑥 \mathbb{P}_{t}(x,\cdot) to denote the distribution of X t subscript 𝑋 𝑡 X_{t} conditioned on X 1 = x subscript 𝑋 1 𝑥 X_{1}=x . The mixing time of the chain is defined by

d ( t ) = sup x Ω D TV ( t ( x , ) , π ) , t mix ( ε ) = min { t | d ( t ) ε } , t mix = t mix ( 1 / 4 ) . formulae-sequence 𝑑 𝑡 subscript supremum 𝑥 Ω subscript 𝐷 TV subscript 𝑡 𝑥 𝜋 formulae-sequence subscript 𝑡 mix 𝜀 conditional 𝑡 𝑑 𝑡 𝜀 subscript 𝑡 mix subscript 𝑡 mix 1 4 \displaystyle d(t)=\sup_{x\in\Omega}D_{\rm TV}\big{(}\mathbb{P}_{t}(x,\cdot),\pi\big{)},\quad t_{\rm mix}(\varepsilon)=\min\{t\hskip 1.42262pt|\hskip 1.42262ptd(t)\leq\varepsilon\},\quad t_{\rm mix}=t_{\rm mix}(1/4). (A.4)

Appendix B Extentions

B.1 LLM Planning via Bayesian Aggregated World Model

Algorithm 2 Planning with PAR System - Planner with LLM as World Model
1: Policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} with η ( 0 , 1 ) 𝜂 0 1 \eta\in(0,1) , parameter c 𝒵 > 0 subscript 𝑐 𝒵 0 c_{\mathcal{Z}}>0 , N s subscript 𝑁 s N_{\rm s}\in\mathbb{N} , and | 𝒵 | 𝒵 |\mathcal{Z}|\in\mathbb{N} ,
2: and reward function r = { r h } h [ H ] 𝑟 subscript subscript 𝑟 delimited-[] 𝐻 r=\{r_{h}\}_{h\in[H]} specified by the human user.
3: 0 { } subscript 0 \mathcal{H}_{0}\leftarrow\{\} , 𝒟 t s { } , t [ T ] formulae-sequence superscript subscript 𝒟 𝑡 s for-all 𝑡 delimited-[] 𝑇 \mathcal{D}_{t}^{\rm s}\leftarrow\{\},\forall t\in[T] , and ϵ ( log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 italic-ϵ superscript subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 \epsilon\leftarrow(\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})/T\eta)^{1/2} .
4: for episode t 𝑡 t from 1 1 1 to T 𝑇 T do
5: Receive the high-level task ω t superscript 𝜔 𝑡 \omega^{t} from the human user.
6: Sample t Bernuolli ( ϵ ) similar-to subscript 𝑡 Bernuolli italic-ϵ \mathcal{I}_{t}\sim\text{Bernuolli}(\epsilon) .
7: for stimulation n 𝑛 n from 1 to N s subscript 𝑁 s N_{\rm s} do
8: Sample g h , n t , s Unif ( 𝒢 ) similar-to superscript subscript 𝑔 𝑛 𝑡 s Unif 𝒢 g_{h,n}^{t,{\rm s}}\sim{\rm Unif}(\mathcal{G}) for all h [ H ] delimited-[] 𝐻 h\in[H] and set 𝚙𝚝 1 , n t t { o 1 t , g 1 , n t , s } superscript subscript 𝚙𝚝 1 𝑛 𝑡 subscript 𝑡 superscript subscript 𝑜 1 𝑡 superscript subscript 𝑔 1 𝑛 𝑡 s \mathtt{pt}_{1,n}^{t}\leftarrow\mathcal{H}_{t}\cup\{o_{1}^{t},g_{1,n}^{t,{\rm s}}\} .
9: for step h h from 1 1 1 to H 𝐻 H do
10: Update 𝚙𝚝 h , n t t { o 1 , n t , g 1 , n t , s , , o h , n t , s , g h , n t , s } superscript subscript 𝚙𝚝 𝑛 𝑡 subscript 𝑡 superscript subscript 𝑜 1 𝑛 𝑡 superscript subscript 𝑔 1 𝑛 𝑡 s superscript subscript 𝑜 𝑛 𝑡 s superscript subscript 𝑔 𝑛 𝑡 s \mathtt{pt}_{h,n}^{t}\leftarrow\mathcal{H}_{t}\cup\big{\{}o_{1,n}^{t},g_{1,n}^{t,{\rm s}},\dots,o_{h,n}^{t,{\rm s}},g_{h,n}^{t,{\rm s}}\big{\}} .
11: Predict o h + 1 , n t , s 𝙻𝙻𝙼 ( | 𝚙𝚝 h , n t ) o_{h+1,n}^{t,{\rm s}}\sim\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h,n}^{t}) via prompting LLM.
12: end for
13: Update 𝒟 t s 𝒟 t s { o 1 , n t , g 1 , n t , s , , o H 1 , n t , s , g H 1 , n t , s , o H , n t , s } superscript subscript 𝒟 𝑡 s superscript subscript 𝒟 𝑡 s superscript subscript 𝑜 1 𝑛 𝑡 superscript subscript 𝑔 1 𝑛 𝑡 s superscript subscript 𝑜 𝐻 1 𝑛 𝑡 s superscript subscript 𝑔 𝐻 1 𝑛 𝑡 s superscript subscript 𝑜 𝐻 𝑛 𝑡 s \mathcal{D}_{t}^{\rm s}\leftarrow\mathcal{D}_{t}^{\rm s}\cup\big{\{}o_{1,n}^{t},g_{1,n}^{t,{\rm s}},\dots,o_{H-1,n}^{t,{\rm s}},g_{H-1,n}^{t,{\rm s}},o_{H,n}^{t,{\rm s}}\big{\}} .
14: end for
15: for step h h from 1 1 1 to H 𝐻 H do
16: Collect the observation o h t superscript subscript 𝑜 𝑡 o_{h}^{t} from the Reporter.
17: Calculate π LLM t Optimal-planning ( ω t , 𝒟 t s , r ) superscript subscript 𝜋 LLM 𝑡 Optimal-planning superscript 𝜔 𝑡 superscript subscript 𝒟 𝑡 s 𝑟 \pi_{\texttt{LLM}}^{t}\leftarrow\textsc{Optimal-planning}(\omega^{t},\mathcal{D}_{t}^{\rm s},r)
18: Sample g h t ( 1 t ) π h , LLM t ( | ω t , τ h t ) + t π h , 𝚎𝚡𝚙 t ( | τ h t ) g_{h}^{t}\sim(1-\mathcal{I}_{t})\cdot\pi_{h,\texttt{LLM}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\omega^{t},\tau_{h}^{t})+\mathcal{I}_{t}\cdot\pi_{h,\mathtt{exp}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t}) .
19: Send the subgoal g h t superscript subscript 𝑔 𝑡 g_{h}^{t} to the Actor.
20: end for
21: Update t + 1 t { ω t , τ H t } subscript 𝑡 1 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝜏 𝐻 𝑡 \mathcal{H}_{t+1}\leftarrow\mathcal{H}_{t}\cup\left\{\omega^{t},\tau_{H}^{t}\right\} .
22: end for

Recall that the pretraining algorithm in § 3.2 also equips LLM with the capability to predict observation generation, i.e., h ( o h | ( o , g ) 1 : h 1 ) subscript conditional subscript 𝑜 subscript 𝑜 𝑔 : 1 1 \mathbb{P}_{h}(o_{h}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h-1}) . Existing literature has shown the benefits of augmenting the reasoning process with predicted world states, as it endows LLMs with a more grounded inference without reliance on expert knowledge (Hu and Shu,, 2023 ) . Specifically, the Planner interactively prompts LLM to internally simulate entire trajectories grounded on historical feedback. By leveraging model-based RL methods such as Monte Carlo Tree Search (Browne et al.,, 2012 ) and Real-Time Dynamic Programming (Barto et al.,, 1995 ) , the Planner utilizes the LLM-simulated environment to optimize its strategies. The planning protocol is as follows: at the beginning of t 𝑡 t -th episode, Planner iteratively prompts LLM with initial observation o 1 subscript 𝑜 1 o_{1} , history t subscript 𝑡 \mathcal{H}_{t} , and subgoals g 1 : H subscript 𝑔 : 1 𝐻 g_{1:H} sequentially to predict observations o 1 : H subscript 𝑜 : 1 𝐻 o_{1:H} . Subsequently, a simulation dataset 𝒟 t s superscript subscript 𝒟 𝑡 s \mathcal{D}_{t}^{\rm s} is collected, allowing the Planner to compute the optimal policy with rewards specified by the human users, using methods such as MCTS. We first show that the LLM-simulated environment conforms to a Bayesian Aggregated World Model (BAWM), and is formalized as follows.

Proposition B.1 (LLM as BAWM) .

Assume that the distribution of pretraining data is given by ( 3.5 ). Under the perfect setting in Definition 4.1 , for each ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , the LLM serves as a Bayesian aggregated world model, following that

𝙻𝙻𝙼 t ( | o 1 , 𝐝𝐨 g 1 : h 1 ) = z 𝒵 z ( | o 1 , 𝐝𝐨 g 1 : h 1 ) 𝒟 ( z | t ) , \displaystyle\mathbb{P}_{\mathtt{LLM}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1})=\sum_{z\in\mathcal{Z}}\mathbb{P}_{z}\left(\cdot\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t}\right), (B.1)

with marginal distributions defined as z ( | o 1 , 𝐝𝐨 g 1 : h 1 ) = o 2 : h 1 h = 1 h 1 z ( o h + 1 | ( o , g ) 1 : h ) d o 2 : h 1 \mathbb{P}_{z}(\cdot\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1})=\int_{o_{2:h-1}}\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{z}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right){\mathrm{d}}o_{2:h-1} and 𝙻𝙻𝙼 t ( | o 1 , 𝐝𝐨 g 1 : h 1 ) = o 2 : h 1 h = 1 h 1 𝒟 ( o h + 1 | ( o , g ) 1 : h , t ) d o 2 : h 1 \mathbb{P}_{\mathtt{LLM}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1})=\int_{o_{2:h-1}}\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{\mathcal{D}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right){\mathrm{d}}o_{2:h-1} .

Proof of Propoition B.1 ..

Please refer to § E.1 for a detailed proof. ∎

Note that the generation distribution 𝙻𝙻𝙼 t ( | ( o , g ) 1 : h ) = LLM ( | ( o , g ) 1 : h , t ) \mathbb{P}_{\mathtt{LLM}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h})=\texttt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h},\mathcal{H}_{t}) is non-stationary, since 𝒟 ( z | ( o , g ) 1 : h , t ) subscript 𝒟 conditional 𝑧 subscript 𝑜 𝑔 : 1 subscript 𝑡 \mathbb{P}_{\mathcal{D}}(z\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h},\mathcal{H}_{t}) fluctuates with simulated part ( o , g ) 1 : h subscript 𝑜 𝑔 : 1 (o,g)_{1:h} due to the autoregressive manner of LLMs. Instead, Proposition B.1 posits that the marginal distribution has a stationary expression based on posterior aggregation. Akin to Assumption 5.6 , we introduce the coverage assumption.

Assumption B.2 (Strong Coverage) .

There exists absolute constants λ S , 1 , λ S , 2 subscript 𝜆 𝑆 1 subscript 𝜆 𝑆 2 \lambda_{S,1},\lambda_{S,2} and λ R subscript 𝜆 𝑅 \lambda_{R} such that for all z 𝒵 𝑧 𝒵 z\in\mathcal{Z} , length t < T ¯ p 𝑡 subscript ¯ 𝑇 p t<\bar{T}_{\rm p} and policy sequence { π i } i t / 2 H subscript superscript 𝜋 𝑖 𝑖 𝑡 2 𝐻 \{\pi^{i}\}_{i\leq\lfloor t/2H\rfloor} from the Planner, it holds that (i) i = 1 t / 2 H ^ z π i ( S ~ i ) λ S , 1 ¯ 𝒟 𝙻𝙻𝙼 ( ( S ~ i ) i t / 2 H ) superscript subscript product 𝑖 1 𝑡 2 𝐻 superscript subscript ^ 𝑧 subscript 𝜋 𝑖 subscript ~ 𝑆 𝑖 subscript 𝜆 𝑆 1 subscript ¯ subscript 𝒟 𝙻𝙻𝙼 subscript subscript ~ 𝑆 𝑖 𝑖 𝑡 2 𝐻 \prod_{i=1}^{\lfloor t/2H\rfloor}\mathbb{\widehat{P}}_{z}^{\pi_{i}}(\tilde{S}_{i})\leq\lambda_{S,1}\cdot\bar{\mathbb{P}}_{\mathcal{D}_{\mathtt{LLM}}}((\tilde{S}_{i})_{i\leq\lfloor t/2H\rfloor}) and ¯ 𝒟 𝙻𝙻𝙼 ( S ~ t / 2 H | ( S ~ i ) i t / 2 H ) λ S , 2 subscript ¯ subscript 𝒟 𝙻𝙻𝙼 conditional subscript ~ 𝑆 𝑡 2 𝐻 subscript subscript ~ 𝑆 𝑖 𝑖 𝑡 2 𝐻 subscript 𝜆 𝑆 2 \bar{\mathbb{P}}_{\mathcal{D}_{\mathtt{LLM}}}(\tilde{S}_{\lceil t/2H\rceil}|(\tilde{S}_{i})_{i\leq\lfloor t/2H\rfloor})\geq\lambda_{S,2} for all ordered S t = ( S ~ i ) i t / 2 H 𝔏 subscript 𝑆 𝑡 subscript subscript ~ 𝑆 𝑖 𝑖 𝑡 2 𝐻 superscript 𝔏 S_{t}=(\tilde{S}_{i})_{i\leq\lceil t/2H\rceil}\in\mathfrak{L}^{*} , where | S ~ i | = 2 H subscript ~ 𝑆 𝑖 2 𝐻 |\tilde{S}_{i}|=2H for all k < t / 2 H 𝑘 𝑡 2 𝐻 k<\lceil t/2H\rceil , (ii) ¯ 𝒟 𝚁𝚎𝚙 ( s ) λ R subscript ¯ subscript 𝒟 𝚁𝚎𝚙 𝑠 subscript 𝜆 𝑅 \bar{\mathbb{P}}_{\mathcal{D}_{\mathtt{Rep}}}(s)\geq\lambda_{R} for all s 𝒮 𝑠 𝒮 s\in\mathcal{S} .

We remark that Assumption B.2 imposes a stronger condition over the coverage, particularly on the in-episode trajectory S ~ t / 2 H subscript ~ 𝑆 𝑡 2 𝐻 \tilde{S}_{\lceil t/2H\rceil} , Here, t / 2 H 𝑡 2 𝐻 \lceil t/2H\rceil denotes the number of episodes described in S t subscript 𝑆 𝑡 S_{t} . The demand of the stronger assumption arises from LLM now serving as a WM, necessitating more extensive information across all kinds of scenarios. Suppose that the Planner can learn optimal policy π ^ 𝙻𝙻𝙼 t , = argmax π Π 𝒥 ^ 𝙻𝙻𝙼 t ( π , ω ) subscript superscript ^ 𝜋 𝑡 𝙻𝙻𝙼 subscript argmax 𝜋 Π superscript subscript ^ 𝒥 𝙻𝙻𝙼 𝑡 𝜋 𝜔 \widehat{\pi}^{t,*}_{\mathtt{LLM}}={\rm argmax}_{\pi\in\Pi}\ \mathcal{\widehat{J}}_{\mathtt{LLM}}^{t}(\pi,\omega) with sufficiently large simulation steps | 𝒟 t s | superscript subscript 𝒟 𝑡 s |\mathcal{D}_{t}^{\rm s}| , where 𝒥 ^ 𝙻𝙻𝙼 t superscript subscript ^ 𝒥 𝙻𝙻𝙼 𝑡 \mathcal{\widehat{J}}_{\mathtt{LLM}}^{t} denotes the value function concerning 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} and history t subscript 𝑡 \mathcal{H}_{t} . Akin to Algorithm 1 , the planning algorithm by taking LLM as WM includes an ϵ italic-ϵ \epsilon -greedy exploration with η 𝜂 \eta -distinguishable π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} . The pseudocode is in Algorithm 2 . The following corollary presents the performance under practical settings.

Corollary B.3 (Regret under Practical Setting with LLM as World Model) .

Suppose that Assumptions 4.5 , 5.1 , 5.2 , 5.4 and 5.6 . Given an η 𝜂 \eta -distinguishable exploration policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} and T T p 𝑇 subscript 𝑇 p T\leq T_{\rm p} , under the practical setting, the Planner’s algorithm in Algorithm 2 ensures that

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) 𝒪 ~ ( H T / η log ( c 𝒵 | 𝒵 | T ) + H 2 T Δ p , wm ( N p , T p , H , 1 / T , ξ ) ) , absent ~ 𝒪 𝐻 𝑇 𝜂 subscript 𝑐 𝒵 𝒵 𝑇 superscript 𝐻 2 𝑇 subscript Δ p wm subscript 𝑁 p subscript 𝑇 p 𝐻 1 𝑇 𝜉 \displaystyle\leq\tilde{\mathcal{O}}\Big{(}H\sqrt{T/\eta\cdot\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})}+H^{2}T\cdot\Delta_{\rm p,wm}(N_{\rm p},T_{\rm p},H,1/\sqrt{T},\xi)\Big{)},

for any z 𝒵 𝑧 𝒵 z\in\mathcal{Z} and { ω t } t [ T ] subscript subscript 𝜔 𝑡 𝑡 delimited-[] 𝑇 \{\omega_{t}\}_{t\in[T]} . The cumulative pretraining error of the PAR system follows

Δ p , wm ( N p , T p , H , δ , ξ ) = 2 ( η λ R ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 subscript Δ p wm subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 𝜉 2 superscript 𝜂 subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 \displaystyle\Delta_{\rm p,wm}(N_{\rm p},T_{\rm p},H,\delta,\xi)=2(\eta\lambda_{R})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}
+ 2 λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) + 2 λ S , 1 λ S , 2 1 Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) . 2 superscript subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 subscript 𝜆 𝑆 1 superscript subscript 𝜆 𝑆 2 1 subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \displaystyle\quad+2\lambda_{R}^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)+2\lambda_{S,1}\lambda_{S,2}^{-1}\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta).

where ξ = ( η , λ S , 1 , λ S , 2 , λ R ) 𝜉 𝜂 subscript 𝜆 𝑆 1 subscript 𝜆 𝑆 2 subscript 𝜆 𝑅 \xi=(\eta,\lambda_{S,1},\lambda_{S,2},\lambda_{R}) are defined in Definition 4.4 and Assumption 5.6 , and errors Δ 𝙻𝙻𝙼 subscript Δ 𝙻𝙻𝙼 \Delta_{\mathtt{LLM}} and Δ 𝚁𝚎𝚙 subscript Δ 𝚁𝚎𝚙 \Delta_{\mathtt{Rep}} are defined in Theorem 2 and Theorem 5.5 . Under practical setting, Planner should explore with probability ϵ = ( log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 + H ( η λ min ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , 1 / T ) 2 italic-ϵ superscript subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 𝐻 superscript 𝜂 subscript 𝜆 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 1 𝑇 2 \epsilon=(\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})/T\eta)^{1/2}+H(\eta\lambda_{\min})^{-1}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,1/\sqrt{T})^{2} .

Proof of Corollary B.3 ..

Please refer to § E.2 for a detailed proof. ∎

B.2 LLM-Empowered Multi-Agent Collaboration

Algorithm 3 Multi-Agent Planning with PAR System - Planner
1: Policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} with η ( 0 , 1 ) 𝜂 0 1 \eta\in(0,1) , parameter c 𝒵 > 0 subscript 𝑐 𝒵 0 c_{\mathcal{Z}}>0 , and | 𝒵 | 𝒵 |\mathcal{Z}|\in\mathbb{N} .
2: 0 subscript 0 \mathcal{H}_{0}\leftarrow\emptyset , and ϵ ( H K log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 italic-ϵ superscript 𝐻 𝐾 subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 \epsilon\leftarrow(HK\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})/T\eta)^{1/2} .
3: for episode t 𝑡 t from 1 1 1 to T 𝑇 T do
4: Receive the high-level task ω t superscript 𝜔 𝑡 \omega^{t} from the human user.
5: Sample t Bernuolli ( ϵ ) similar-to subscript 𝑡 Bernuolli italic-ϵ \mathcal{I}_{t}\sim\text{Bernuolli}(\epsilon) .
6: for step h h from 1 1 1 to H 𝐻 H do
7: Collect the observation o h t superscript subscript 𝑜 𝑡 o_{h}^{t} from Reporter.
8: for Actor k 𝑘 k from 1 1 1 to K 𝐾 K do
9: Set 𝚙𝚝 h , k t t { ω t , o 1 t , g 1 t , , o h t , k } superscript subscript 𝚙𝚝 𝑘 𝑡 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝑜 1 𝑡 superscript subscript 𝑔 1 𝑡 superscript subscript 𝑜 𝑡 𝑘 \mathtt{pt}_{h,k}^{t}\leftarrow\mathcal{H}_{t}\cup\left\{\omega^{t},o_{1}^{t},g_{1}^{t},\dots,o_{h}^{t},k\right\} .
10: Sample g h , k , 𝙻𝙻𝙼 t 𝙻𝙻𝙼 ( | 𝚙𝚝 h , k t ) g_{h,k,\mathtt{LLM}}^{t}\sim\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h,k}^{t}) via prompting LLM.
11: end for
12: If t = 1 subscript 𝑡 1 \mathcal{I}_{t}=1 then g h t g h , 𝙻𝙻𝙼 t superscript subscript 𝑔 𝑡 superscript subscript 𝑔 𝙻𝙻𝙼 𝑡 g_{h}^{t}\leftarrow g_{h,\mathtt{LLM}}^{t} , else sample g h t π h , 𝚎𝚡𝚙 ( | τ h t ) g_{h}^{t}\sim\pi_{h,\mathtt{exp}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t}) .
13: end for
14: Send the subgoal g h t superscript subscript 𝑔 𝑡 g_{h}^{t} to the Actors.
15: Update t + 1 t { ω t , τ H t } subscript 𝑡 1 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝜏 𝐻 𝑡 \mathcal{H}_{t+1}\leftarrow\mathcal{H}_{t}\cup\left\{\omega^{t},\tau_{H}^{t}\right\} .
16: end for

To characterize the multi-agent interactive process, i.e., several Actors, of task planning, we consider a turn-based cooperative hierarchical Markov Game (HMG), corresponding to HMDP in § 3.1 . Instead, HMG consists of a low-level language-conditioned Markov Game (MG) and a high-level language-conditioned cooperative Partially Observable Markov Game (POMG). To extend this framework, we introduce the following modifications: (i) low-level MG: let 𝒦 = [ K ] 𝒦 delimited-[] 𝐾 \mathcal{K}=[K] be the set of Actors, and 𝒢 = 𝒢 1 × × 𝒢 K 𝒢 subscript 𝒢 1 subscript 𝒢 𝐾 \mathcal{G}=\mathcal{G}_{1}\times\dots\times\mathcal{G}_{K} and 𝒜 = 𝒜 1 × × 𝒜 K 𝒜 subscript 𝒜 1 subscript 𝒜 𝐾 \mathcal{A}=\mathcal{A}_{1}\times\dots\times\mathcal{A}_{K} be the space of subgoals and low-level actions. Low-level Actors conduct planning following a joint policy μ = { μ h } h [ H ] 𝜇 subscript subscript 𝜇 delimited-[] 𝐻 \mu=\{\mu_{h}\}_{h\in[H]} with μ h : 𝒮 × 𝒢 Δ ( 𝒜 ) : subscript 𝜇 maps-to 𝒮 𝒢 Δ 𝒜 \mu_{h}:\mathcal{S}\times\mathcal{G}\mapsto\Delta(\mathcal{A}) , where { μ h , k } k 𝒦 subscript subscript 𝜇 𝑘 𝑘 𝒦 \{\mu_{h,k}\}_{k\in\mathcal{K}} can be correlated, e.g., within zero-sum game, Stackelberg game (Başar and Olsder,, 1998 ) . (ii) high-level POMG: under cooperation, assume that policies can be factorized as

π h ( 𝐠 h | τ h 1 , ω ) = k = 1 K π h , k ( g h , k | τ h 1 , ω ) , h [ H ] . formulae-sequence subscript 𝜋 conditional subscript 𝐠 subscript 𝜏 1 𝜔 superscript subscript product 𝑘 1 𝐾 subscript 𝜋 𝑘 conditional subscript 𝑔 𝑘 subscript 𝜏 1 𝜔 for-all delimited-[] 𝐻 \pi_{h}(\mathbf{g}_{h}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h-1},\omega)=\prod_{k=1}^{K}\pi_{h,k}(g_{h,k}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h-1},\omega),\quad\forall h\in[H].

The remaining concepts are consistent with HMDP. Here, the Planner assumes the role of central controller and solves a fully-cooperative POMG that aims to maximize a shared value function. Thus, the Planner should infer both the Actors’ intentions, i.e., joint policy μ 𝜇 \mu , and the environment, i.e., transition kernel 𝕋 𝕋 \mathbb{T} , from the historical context, and then assign subgoal for each Actor.

Specifically, the LLM’s recommendations are obtained by invoking the ICL ability of LLMs with the history-dependent prompt akin to ( 3.2 ) sequentially for each Actor. For the k 𝑘 k -th Actor, prompt LLM with 𝚙𝚝 h , k t = t { ω t , τ h t , k } superscript subscript 𝚙𝚝 𝑘 𝑡 subscript 𝑡 superscript 𝜔 𝑡 superscript subscript 𝜏 𝑡 𝑘 \mathtt{pt}_{h,k}^{t}=\mathcal{H}_{t}\cup\{\omega^{t},\tau_{h}^{t},k\} , where denote t = i = 1 t 1 { ω i , τ H i } subscript 𝑡 superscript subscript 𝑖 1 𝑡 1 superscript 𝜔 𝑖 superscript subscript 𝜏 𝐻 𝑖 \mathcal{H}_{t}=\bigcup_{i=1}^{t-1}\{\omega^{i},\tau_{H}^{i}\} and τ h t = { o h 1 , 𝐠 h 1 , , o h t } superscript subscript 𝜏 𝑡 superscript subscript 𝑜 1 superscript subscript 𝐠 1 superscript subscript 𝑜 𝑡 \tau_{h}^{t}=\{o_{h}^{1},\mathbf{g}_{h}^{1},\dots,o_{h}^{t}\} . Under the perfect setting (see Definition 4.1 ), LLM’s joint policy for recommendations follows:

π h , 𝙻𝙻𝙼 t ( 𝐠 h t | τ h t , ω t ) superscript subscript 𝜋 𝙻𝙻𝙼 𝑡 conditional superscript subscript 𝐠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle\pi_{h,\mathtt{LLM}}^{t}\big{(}\mathbf{g}_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\big{)} = k 𝒦 ( z 𝒵 π z , h , k ( g h , k t | τ h t , ω t ) 𝒟 ( z | 𝚙𝚝 h t ) ) , absent subscript product 𝑘 𝒦 subscript 𝑧 𝒵 subscript superscript 𝜋 𝑧 𝑘 conditional superscript subscript 𝑔 𝑘 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 subscript 𝒟 conditional 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle=\prod_{k\in\mathcal{K}}\left(\sum_{z\in\mathcal{Z}}\pi^{*}_{z,h,k}\left(g_{h,k}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right)\right), (B.2)

which is akin to Proposition 4.2 and the proof of the statement is provided in § E.3 . The pseudocode is presented in Algorithm 3 . Then, we give the performance guarantee under multi-agent scenarios with the perfect PAR system.

Corollary B.4 (Multi-agent Collaboration Regret under Perfect Setting) .

Suppose that Assumptions 4.1 and 4.5 hold. Given an η 𝜂 \eta -distinguishable exploration policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} and T T p 𝑇 subscript 𝑇 p T\leq T_{\rm p} , the Planner’s algorithm in Algorithm 3 guarantees that

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) 𝒪 ~ ( H 3 2 T K / η log ( c 𝒵 | 𝒵 | T ) ) , absent ~ 𝒪 superscript 𝐻 3 2 𝑇 𝐾 𝜂 subscript 𝑐 𝒵 𝒵 𝑇 \displaystyle\leq\tilde{\mathcal{O}}\left(H^{\frac{3}{2}}\sqrt{TK/\eta\cdot\log\left(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T}\right)}\right),

for any z 𝒵 𝑧 𝒵 z\in\mathcal{Z} and { ω t } t [ T ] subscript superscript 𝜔 𝑡 𝑡 delimited-[] 𝑇 \{\omega^{t}\}_{t\in[T]} , if Planner explores with ϵ = ( H K log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 italic-ϵ superscript 𝐻 𝐾 subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 \epsilon=(HK\log\left(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T}\right)/T\eta)^{1/2} .

Proof of Corollary B.4 ..

Please refer to § E.3 for a detailed proof. ∎

Corollary B.4 is akin to Theorem 4.6 with an additional K 𝐾 \sqrt{K} in regret. Besides, the multi-agent space of latent variable | 𝒵 | = | 𝒵 𝕋 | × | 𝒵 μ , m | 𝒵 subscript 𝒵 𝕋 subscript 𝒵 𝜇 m |\mathcal{Z}|=|\mathcal{Z}_{\mathbb{T}}|\times|\mathcal{Z}_{\mu,\rm m}| , where 𝒵 μ , m subscript 𝒵 𝜇 m \mathcal{Z}_{\mu,\rm m} is the space of joint policy, is generally larger than the single-agent space. Specifically, if responses are uncorrelated, then we have log | 𝒵 μ , m | = K log | 𝒵 μ , s | subscript 𝒵 𝜇 m 𝐾 subscript 𝒵 𝜇 s \log|\mathcal{Z}_{\mu,\rm m}|=K\log|\mathcal{Z}_{\mu,\rm s}| , resulting in a K 𝐾 \sqrt{K} times larger regret. The proof of extension to practical setting is akin to Corollary B.4 based on derivations in Theorem 5.7 , and is omitted.

Appendix C Proof for Section 4 : Perfect Setting

C.1 Proof of Proposition 4.2

Proof of Proposition 4.2 . Note that for all h [ H ] delimited-[] 𝐻 h\in[H] and t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] , we have

π h , 𝙻𝙻𝙼 t ( g h t | τ h t , ω t ) superscript subscript 𝜋 𝙻𝙻𝙼 𝑡 conditional superscript subscript 𝑔 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle\pi_{h,\mathtt{LLM}}^{t}\left(g_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right) = z 𝒵 𝒟 ( g h t | 𝚙𝚝 h t , z ) 𝒟 ( z | 𝚙𝚝 h t ) absent subscript 𝑧 𝒵 subscript 𝒟 conditional superscript subscript 𝑔 𝑡 superscript subscript 𝚙𝚝 𝑡 𝑧 subscript 𝒟 conditional 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle=\sum_{z\in\mathcal{Z}}\mathbb{P}_{\mathcal{D}}\left(g_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t},z\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right)
= z 𝒵 𝒟 ( g h t | t , τ h t , ω t , z ) 𝒟 ( z | 𝚙𝚝 h t ) absent subscript 𝑧 𝒵 subscript 𝒟 conditional superscript subscript 𝑔 𝑡 subscript 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 𝑧 subscript 𝒟 conditional 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle=\sum_{z\in\mathcal{Z}}\mathbb{P}_{\mathcal{D}}\left(g_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t},\tau_{h}^{t},\omega^{t},z\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right)
= z 𝒵 π z , h ( | τ h t , ω t ) 𝒟 ( z | 𝚙𝚝 h t ) , \displaystyle=\sum_{z\in\mathcal{Z}}\pi^{*}_{z,h}\left(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right), (C.1)

where the second equation results from the law of total probability, the third equation follows the definition of prompts in ( 3.2 ), and the last equation results from the generation distribution. \Box

C.2 Proof of Theorem 4.6

Proof of Thereom 4.6 . Recall that the Planner takes a mixture policy of π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} and π 𝙻𝙻𝙼 subscript 𝜋 𝙻𝙻𝙼 \pi_{\mathtt{LLM}} such that

π h t ( | τ h t , ω t ) ( 1 ϵ ) π h , 𝙻𝙻𝙼 t ( | τ h t , ω t ) + ϵ π h , 𝚎𝚡𝚙 ( | τ h t ) , \pi_{h}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})\sim(1-\epsilon)\cdot\pi_{h,\mathtt{LLM}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})+\epsilon\cdot\pi_{h,\mathtt{exp}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t}), (C.2)

and Proposition 4.2 indicates that LLM’s recommended policies take the form:

π h , 𝙻𝙻𝙼 t ( | τ h t , ω t ) = z 𝒵 π z , h ( | τ h t , ω t ) 𝒟 ( z | 𝚙𝚝 h t ) , where 𝚙𝚝 h t = t τ h t , t = { ω i , τ H i } i [ t 1 ] , \displaystyle\pi_{h,\mathtt{LLM}}^{t}\left(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right)=\sum_{z\in\mathcal{Z}}\pi^{*}_{z,h}\left(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right),\text{~{}where~{}}\mathtt{pt}_{h}^{t}=\mathcal{H}_{t}\cup\tau_{h}^{t},\mathcal{H}_{t}=\left\{\omega^{i},\tau_{H}^{i}\right\}_{i\in[t-1]}, (C.3)

for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] . Following ( C.2 ), given z 𝒵 𝑧 𝒵 z\in\mathcal{Z} and { ω t } t [ T ] subscript superscript 𝜔 𝑡 𝑡 delimited-[] 𝑇 \{\omega^{t}\}_{t\in[T]} , the regret is decomposed as

Reg ( T ) Reg 𝑇 \displaystyle\text{Reg}(T) = t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝚎𝚡𝚙 ) Q z , h ( s h t , τ h t , ω t ) ] ϵ (i) absent subscript superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript 𝜋 𝚎𝚡𝚙 superscript subscript 𝑄 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 italic-ϵ (i) \displaystyle=\underbrace{\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-\pi_{h,\mathtt{exp}}\right)Q_{z,h}^{*}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]\cdot\epsilon}_{\textbf{(i)}}
+ t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h t , τ h t , ω t ) ] ( 1 ϵ ) (ii) subscript superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 1 italic-ϵ (ii) \displaystyle\hskip 11.38092pt+\underbrace{\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]\cdot(1-\epsilon)}_{\textbf{(ii)}}
t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h t , τ h t , ω t ) ] + H T ϵ , absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 𝐻 𝑇 italic-ϵ \displaystyle\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]+HT\epsilon, (C.4)

where the second equation results from performance difference lemma (PDL, see Lemma F.4 ), and we write π h Q h ( s h , τ h , ω ) = π h ( | τ h , ω ) , Q h ( s h , τ h , , ω ) 𝒢 \pi_{h}Q_{h}(s_{h},\tau_{h},\omega)=\langle\pi_{h}(\cdot|\tau_{h},\omega),Q_{h}(s_{h},\tau_{h},\cdot,\omega)\rangle_{\mathcal{G}} , and z π ( τ h ) superscript subscript 𝑧 𝜋 subscript 𝜏 \mathbb{P}_{z}^{\pi}(\tau_{h}) is defined in ( A.3 ). Based on Lemma C.1 , with probability at least 1 δ 1 𝛿 1-\delta , the following event 1 subscript 1 \mathcal{E}_{1} holds: for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] ,

z 𝒵 i [ t ] D H 2 ( z π i ( τ ˘ h / t i ) , z π i ( τ ˘ h / t i ) ) 𝒟 ( z | 𝚙𝚝 h t ) 2 log ( c 𝒵 | 𝒵 | / δ ) , subscript superscript 𝑧 𝒵 subscript 𝑖 delimited-[] 𝑡 superscript subscript 𝐷 H 2 superscript subscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript superscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 2 subscript 𝑐 𝒵 𝒵 𝛿 \sum_{z^{\prime}\in\mathcal{Z}}\sum_{i\in[t]}D_{\rm H}^{2}\big{(}\mathbb{P}_{z}^{\pi^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{P}_{z^{\prime}}^{\pi^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\leq 2\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right), (C.5)

where the randomness is incurred by 𝚙𝚝 h t superscript subscript 𝚙𝚝 𝑡 \mathtt{pt}_{h}^{t} and define τ ˘ h / t i = τ H superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝜏 𝐻 \breve{\tau}_{h/t}^{i}={\tau}_{H} for all i [ t 1 ] 𝑖 delimited-[] 𝑡 1 i\in[t-1] and τ ˘ h / t t = τ h superscript subscript ˘ 𝜏 𝑡 𝑡 subscript 𝜏 \breve{\tau}_{h/t}^{t}={\tau}_{h} for notational simplicity. Suppose that event 1 subscript 1 \mathcal{E}_{1} in ( C.5 ) holds, and denote 𝒳 𝚎𝚡𝚙 t = { i [ t ] : π i = π 𝚎𝚡𝚙 } subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 conditional-set 𝑖 delimited-[] 𝑡 superscript 𝜋 𝑖 subscript 𝜋 𝚎𝚡𝚙 \mathcal{X}^{t}_{\mathtt{exp}}=\{i\in[t]:\pi^{i}=\pi_{\mathtt{exp}}\} as the set of exploration episodes. Note that for all ( h , t , z ) [ H ] × [ T ] × 𝒵 𝑡 superscript 𝑧 delimited-[] 𝐻 delimited-[] 𝑇 𝒵 (h,t,z^{\prime})\in[H]\times[T]\times\mathcal{Z} , it holds that

i [ t ] D H 2 ( z π i ( τ ˘ h / t i ) , z π i ( τ ˘ h / t i ) ) i 𝒳 𝚎𝚡𝚙 t 1 D H 2 ( z π 𝚎𝚡𝚙 ( τ H ) , z π 𝚎𝚡𝚙 ( τ H ) ) η | 𝒳 𝚎𝚡𝚙 t 1 | , subscript 𝑖 delimited-[] 𝑡 superscript subscript 𝐷 H 2 superscript subscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript superscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝑖 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 superscript subscript 𝐷 H 2 superscript subscript 𝑧 subscript 𝜋 𝚎𝚡𝚙 subscript 𝜏 𝐻 superscript subscript superscript 𝑧 subscript 𝜋 𝚎𝚡𝚙 subscript 𝜏 𝐻 𝜂 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 \displaystyle\sum_{i\in[t]}D_{\rm H}^{2}\big{(}\mathbb{P}_{z}^{\pi^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{P}_{z^{\prime}}^{\pi^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\geq\sum_{i\in\mathcal{X}^{t-1}_{\mathtt{exp}}}D_{\rm H}^{2}\left(\mathbb{P}_{z}^{\pi_{\mathtt{exp}}}({\tau}_{H}),\mathbb{P}_{z^{\prime}}^{\pi_{\mathtt{exp}}}({\tau}_{H})\right)\geq{\eta\cdot|\mathcal{X}^{t-1}_{\mathtt{exp}}|}, (C.6)

where the last inequality results from π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} is η 𝜂 \eta -distinguishable (see Definition 4.4 ) and the fact that D H 2 ( P , Q ) 1 superscript subscript 𝐷 H 2 𝑃 𝑄 1 D_{\rm H}^{2}(P,Q)\leq 1 for all P , Q Δ ( 𝒳 ) 𝑃 𝑄 Δ 𝒳 P,Q\in\Delta(\mathcal{X}) . Combine ( C.5 ) and ( C.6 ), we can get

z z 𝒟 ( z | 𝚙𝚝 h t ) min { 2 log ( c 𝒵 | 𝒵 | / δ ) η 1 / | 𝒳 𝚎𝚡𝚙 t 1 | , 1 } , subscript superscript 𝑧 𝑧 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 2 subscript 𝑐 𝒵 𝒵 𝛿 superscript 𝜂 1 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 1 \sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\leq\min\left\{2\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right)\eta^{-1}/|\mathcal{X}^{t-1}_{\mathtt{exp}}|,1\right\}, (C.7)

for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] . Recall that ( C.3 ) indicates that for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , we have

( π z , h π h , 𝙻𝙻𝙼 t ) ( | τ h , ω ) = z z ( π z , h π z , h ) ( | τ h , ω ) 𝒟 ( z | 𝚙𝚝 h t ) . \left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega)=\sum_{z^{\prime}\neq z}\left(\pi_{z,h}^{*}-\pi_{z^{\prime},h}^{*}\right)(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega)\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}).

Based on Proposition 4.2 and conditioned on 1 subscript 1 \mathcal{E}_{1} , it holds that

t = 1 T superscript subscript 𝑡 1 𝑇 \displaystyle\sum_{t=1}^{T} h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h t , τ h t , ω t ) ] superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]
H t = 1 T h = 1 H z z 𝔼 t i = 1 t 1 z π i 𝔼 τ h t z π t [ 𝒟 ( z | 𝚙𝚝 h t ) ] \displaystyle\leq H\cdot\sum_{t=1}^{T}\sum_{h=1}^{H}\sum_{z^{\prime}\neq z}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{P}_{z}^{\pi^{t}}}\Biggl{[}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\Biggl{]}
2 log ( c 𝒵 | 𝒵 | / δ ) H η 1 t = 1 T h = 1 H 𝔼 [ min { 1 / | 𝒳 𝚎𝚡𝚙 t 1 | , 1 } ] , absent 2 subscript 𝑐 𝒵 𝒵 𝛿 𝐻 superscript 𝜂 1 superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 𝔼 delimited-[] 1 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 1 \displaystyle\leq 2\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)H\eta^{-1}\cdot\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}\left[\min\left\{1/|\mathcal{X}^{t-1}_{\mathtt{exp}}|,1\right\}\right], (C.8)

Note that 𝟙 ( π t = π 𝚎𝚡𝚙 ) iid Bernuolli ( ϵ ) 1 superscript 𝜋 𝑡 subscript 𝜋 𝚎𝚡𝚙 iid similar-to Bernuolli italic-ϵ \mathds{1}(\pi^{t}=\pi_{\mathtt{exp}})\overset{\rm iid}{\sim}\text{Bernuolli}(\epsilon) for all t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] . Besides,the following event 2 subscript 2 \mathcal{E}_{2} holds:

t = 1 T min { 1 / | 𝒳 𝚎𝚡𝚙 t 1 | , 1 } 𝒪 ( ϵ 1 log ( T log T / δ ) ) . superscript subscript 𝑡 1 𝑇 1 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 1 𝒪 superscript italic-ϵ 1 𝑇 𝑇 𝛿 \displaystyle\sum_{t=1}^{T}\min\left\{1/|\mathcal{X}^{t-1}_{\mathtt{exp}}|,1\right\}\leq\mathcal{O}(\epsilon^{-1}\log(T\log T/\delta)). (C.9)

with probability at least 1 δ 1 𝛿 1-\delta based on Lemma F.5 . Combine ( C.4 ), ( C.8 ) and ( C.9 ), we have

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h , τ h , ω t ) 𝟙 ( 1 2 holds ) ] absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 subscript 𝑠 subscript 𝜏 superscript 𝜔 𝑡 1 subscript 1 subscript 2 holds \displaystyle\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h},\tau_{h},\omega^{t})\operatorname{\mathds{1}}\left(\mathcal{E}_{1}\cap\mathcal{E}_{2}\text{~{}holds}\right)\right]
+ t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h , τ h , ω t ) 𝟙 ( 1 2 fails ) ] + H T ϵ superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 subscript 𝑠 subscript 𝜏 superscript 𝜔 𝑡 1 subscript 1 subscript 2 fails 𝐻 𝑇 italic-ϵ \displaystyle\quad+\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h},\tau_{h},\omega^{t})\operatorname{\mathds{1}}\left(\mathcal{E}_{1}\cap\mathcal{E}_{2}\text{~{}fails}\right)\right]+HT\epsilon
𝒪 ( log ( c 𝒵 | 𝒵 | / δ ) H 2 log ( T log T / δ ) ( η ϵ ) 1 + H T ϵ + 2 H T δ ) absent 𝒪 subscript 𝑐 𝒵 𝒵 𝛿 superscript 𝐻 2 𝑇 𝑇 𝛿 superscript 𝜂 italic-ϵ 1 𝐻 𝑇 italic-ϵ 2 𝐻 𝑇 𝛿 \displaystyle\leq\mathcal{O}\Big{(}\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)H^{2}\log(T\log T/\delta)\cdot(\eta\epsilon)^{-1}+HT\epsilon+2HT\delta\Big{)}
𝒪 ~ ( H 3 2 log ( c 𝒵 | 𝒵 | / δ ) T / η ) , absent ~ 𝒪 superscript 𝐻 3 2 subscript 𝑐 𝒵 𝒵 𝛿 𝑇 𝜂 \displaystyle\leq\tilde{\mathcal{O}}\left(H^{\frac{3}{2}}\sqrt{\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right)T/\eta}\right),

where we choose to expolre with probability ϵ = ( H log ( c 𝒵 | 𝒵 | / δ ) / T η ) 1 / 2 italic-ϵ superscript 𝐻 subscript 𝑐 𝒵 𝒵 𝛿 𝑇 𝜂 1 2 \epsilon=(H\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right)/T\eta)^{1/2} . If we take δ = 1 / T 𝛿 1 𝑇 \delta=1/\sqrt{T} in the arguments above, then we can conclude the proof of Theorem 4.6 . \Box

C.3 Proof of Lemma C.1

Lemma C.1 .

Suppose that Assumptions 4.1 and 4.5 hold. Given δ ( 0 , 1 ) 𝛿 0 1 \delta\in(0,1) and ground-truth z 𝒵 𝑧 𝒵 z\in\mathcal{Z} , for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , with probability at least 1 δ 1 𝛿 1-\delta , it holds that

z 𝒵 i [ t ] D H 2 ( z π i ( τ ˘ h / t i ) , z π i ( τ ˘ h / t i ) ) 𝒟 ( z | 𝚙𝚝 h t ) 2 log ( c 𝒵 | 𝒵 | / δ ) , subscript superscript 𝑧 𝒵 subscript 𝑖 delimited-[] 𝑡 superscript subscript 𝐷 H 2 superscript subscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript superscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 2 subscript 𝑐 𝒵 𝒵 𝛿 \sum_{z^{\prime}\in\mathcal{Z}}\sum_{i\in[t]}D_{\rm H}^{2}\big{(}\mathbb{P}_{z}^{\pi^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{P}_{z^{\prime}}^{\pi^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\leq 2\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right),

where denote τ ˘ h / t i = τ H superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝜏 𝐻 \breve{\tau}_{h/t}^{i}={\tau}_{H} for all i < t 𝑖 𝑡 i<t and τ ˘ h / t t = τ h superscript subscript ˘ 𝜏 𝑡 𝑡 subscript 𝜏 \breve{\tau}_{h/t}^{t}={\tau}_{h} , and z π ( τ h ) superscript subscript 𝑧 𝜋 subscript 𝜏 \mathbb{P}_{z}^{\pi}(\tau_{h}) is defined in ( A.3 ).

Proof of Lemma C.1 . The proof is rather standard (e.g., see Geer, ( 2000 ) ). Let 𝔉 t subscript 𝔉 𝑡 \mathfrak{F}_{t} be the filtration induced by { ω i , τ H i } i < t { 𝟙 ( π i = π exp ) } i [ t ] subscript superscript 𝜔 𝑖 superscript subscript 𝜏 𝐻 𝑖 𝑖 𝑡 subscript 1 superscript 𝜋 𝑖 subscript 𝜋 exp 𝑖 delimited-[] 𝑡 \{\omega^{i},\tau_{H}^{i}\}_{i<t}\cup\{\operatorname{\mathds{1}}(\pi^{i}=\pi_{\texttt{exp}})\}_{i\in[t]} . For all ( h , t , z ) [ H ] × [ T ] × 𝒵 𝑡 superscript 𝑧 delimited-[] 𝐻 delimited-[] 𝑇 𝒵 (h,t,z^{\prime})\in[H]\times[T]\times\mathcal{Z} , with probability at least 1 δ 1 𝛿 1-\delta , the information gain concerning z superscript 𝑧 z^{\prime} satisfies that

L h , t ( z ) = i = 1 t log ( z ( τ ˘ h / t i ) z ( τ ˘ h / t i ) ) subscript 𝐿 𝑡 superscript 𝑧 superscript subscript 𝑖 1 𝑡 subscript superscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 \displaystyle L_{h,t}(z^{\prime})=\sum_{i=1}^{t}\log\left(\frac{\mathbb{P}_{z^{\prime}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}(\breve{\tau}_{h/t}^{i})}\right) 2 log 𝔼 𝔉 1 : t [ exp ( 1 2 i = 1 t log z ( τ ˘ h / t i ) z ( τ ˘ h / t i ) ) ] + 2 log ( | 𝒵 | / δ ) , absent 2 subscript 𝔼 subscript 𝔉 : 1 𝑡 delimited-[] 1 2 superscript subscript 𝑖 1 𝑡 subscript superscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 2 𝒵 𝛿 \displaystyle\leq 2\log\mathbb{E}_{\mathfrak{F}_{1:t}}\left[\exp\left(\frac{1}{2}\sum_{i=1}^{t}\log\frac{\mathbb{P}_{z^{\prime}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}(\breve{\tau}_{h/t}^{i})}\right)\right]+2\log(|\mathcal{Z}|/\delta), (C.10)

where the inequality follows Lemma F.1 with λ = 1 / 2 𝜆 1 2 \lambda=1/2 and a union bound taken over 𝒵 𝒵 \mathcal{Z} . Besides,

𝔼 𝔉 1 : t [ exp ( 1 2 i = 1 t log z ( τ ˘ h / t i ) z ( τ ˘ h / t i ) ) ] = i = 1 t ( 1 D H 2 ( z π i ( τ ˘ h / t i ) , z π i ( τ ˘ h / t i ) ) ) . subscript 𝔼 subscript 𝔉 : 1 𝑡 delimited-[] 1 2 superscript subscript 𝑖 1 𝑡 subscript superscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript product 𝑖 1 𝑡 1 superscript subscript 𝐷 H 2 superscript subscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript superscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 \displaystyle\mathbb{E}_{\mathfrak{F}_{1:t}}\left[\exp\left(\frac{1}{2}\sum_{i=1}^{t}\log\frac{\mathbb{P}_{z^{\prime}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}(\breve{\tau}_{h/t}^{i})}\right)\right]=\prod_{i=1}^{t}\left(1-D_{\rm H}^{2}\big{(}\mathbb{P}_{z}^{\pi^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{P}_{z^{\prime}}^{\pi^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\right). (C.11)

Combine ( C.10 ), ( C.11 ) and fact that log ( 1 x ) x 1 𝑥 𝑥 \log(1-x)\leq-x for all x 1 𝑥 1 x\leq 1 , it holds that

L h , t ( z ) 2 i = 1 t D H 2 ( z π i ( τ ˘ h / t i ) , z π i ( τ ˘ h / t i ) ) + 2 log ( | 𝒵 | / δ ) , subscript 𝐿 𝑡 superscript 𝑧 2 superscript subscript 𝑖 1 𝑡 superscript subscript 𝐷 H 2 superscript subscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript superscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 2 𝒵 𝛿 L_{h,t}(z^{\prime})\leq-2\sum_{i=1}^{t}D_{\rm H}^{2}\big{(}\mathbb{P}_{z}^{\pi^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{P}_{z^{\prime}}^{\pi^{i}}(\breve{\tau}_{h/t}^{i})\big{)}+2\log(|\mathcal{Z}|/\delta), (C.12)

with probability greater than 1 δ 1 𝛿 1-\delta . Based on the Donsker-Varadhan representation in Lemma F.2 and duality principle, we have log 𝔼 Q [ e f ] = sup P Δ ( 𝒳 ) { 𝔼 P [ f ] D KL ( P Q ) } subscript 𝔼 𝑄 delimited-[] superscript 𝑒 𝑓 subscript supremum 𝑃 Δ 𝒳 subscript 𝔼 𝑃 delimited-[] 𝑓 subscript 𝐷 KL conditional 𝑃 𝑄 \log\mathbb{E}_{Q}[e^{f}]=\sup_{P\in\Delta(\mathcal{X})}\left\{\mathbb{E}_{P}\left[f\right]-D_{\rm KL}(P\hskip 1.42262pt\|\hskip 1.42262ptQ)\right\} , where the supremum is taken at P ( x ) exp ( f ( x ) ) Q ( x ) proportional-to 𝑃 𝑥 𝑓 𝑥 𝑄 𝑥 P(x)\propto\exp(f(x))\cdot Q(x) . Please refer to Lemma 4.10 in Van Handel, ( 2014 ) for detailed proof. Based on the arguments above, for all ( h , t , P ) [ H ] × [ T ] × Δ ( 𝒵 ) 𝑡 𝑃 delimited-[] 𝐻 delimited-[] 𝑇 Δ 𝒵 (h,t,P)\in[H]\times[T]\times\Delta(\mathcal{Z}) , it holds

z 𝒵 L h , t ( z ) P ( z ) D KL ( P 𝒫 𝒵 ) z 𝒵 L h , t ( z ) 𝒟 ( z | 𝚙𝚝 h t ) D KL ( 𝒟 ( | 𝚙𝚝 h t ) 𝒫 𝒵 ) . \displaystyle\sum_{z^{\prime}\in\mathcal{Z}}L_{h,t}(z^{\prime})\cdot P(z^{\prime})-D_{\rm KL}\big{(}P\hskip 1.42262pt\|\hskip 1.42262pt\mathcal{P}_{\mathcal{Z}}\big{)}\leq\sum_{z^{\prime}\in\mathcal{Z}}L_{h,t}(z^{\prime})\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})-D_{\rm KL}\big{(}\mathbb{P}_{\mathcal{D}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\hskip 1.42262pt\|\hskip 1.42262pt\mathcal{P}_{\mathcal{Z}}\big{)}. (C.13)

since 𝒟 ( z | 𝚙𝚝 h t ) exp ( L h , t ( z ) ) 𝒫 𝒵 ( z ) proportional-to subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝐿 𝑡 superscript 𝑧 subscript 𝒫 𝒵 superscript 𝑧 \mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\propto\exp\left(L_{h,t}(z^{\prime})\right)\cdot\mathcal{P}_{\mathcal{Z}}(z^{\prime}) for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] . Let δ z ( ) subscript 𝛿 𝑧 \delta_{z}(\cdot) bs the Dirac distribution over the singleton z 𝑧 z . Following this, by taking P = δ z 𝑃 subscript 𝛿 𝑧 P=\delta_{z} in ( C.13 ), we have

z 𝒵 L h , t ( z ) 𝒟 ( z | 𝚙𝚝 h t ) D KL ( 𝒟 ( | 𝚙𝚝 h t ) 𝒫 𝒵 ) + log 𝒫 𝒵 ( z ) log 𝒫 𝒵 ( z ) , \sum_{z^{\prime}\in\mathcal{Z}}L_{h,t}(z^{\prime})\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\geq D_{\rm KL}\big{(}\mathbb{P}_{\mathcal{D}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\hskip 1.42262pt\|\hskip 1.42262pt\mathcal{P}_{\mathcal{Z}}\big{)}+\log\mathcal{P}_{\mathcal{Z}}(z)\geq\log\mathcal{P}_{\mathcal{Z}}(z), (C.14)

where the first inequality uses D KL ( δ z ( ) 𝒫 𝒵 ( ) ) = log 𝒫 𝒵 ( z ) subscript 𝐷 KL conditional subscript 𝛿 𝑧 subscript 𝒫 𝒵 subscript 𝒫 𝒵 𝑧 D_{\rm KL}(\delta_{z}(\cdot)\hskip 1.42262pt\|\hskip 1.42262pt\mathcal{P}_{\mathcal{Z}}(\cdot))=-\log\mathcal{P}_{\mathcal{Z}}(z) based on the definitions. Therefore, for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , with probability at least 1 δ 1 𝛿 1-\delta , it holds that

z 𝒵 i [ t ] D H 2 ( z π i ( τ ˘ h / t i ) , z π i ( τ ˘ h / t i ) ) 𝒟 ( z | 𝚙𝚝 h t ) subscript superscript 𝑧 𝒵 subscript 𝑖 delimited-[] 𝑡 superscript subscript 𝐷 H 2 superscript subscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript superscript 𝑧 superscript 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle\sum_{z^{\prime}\in\mathcal{Z}}\sum_{i\in[t]}D_{\rm H}^{2}\big{(}\mathbb{P}_{z}^{\pi^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{P}_{z^{\prime}}^{\pi^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}) z 𝒵 L h , t ( z ) / 2 𝒟 ( z | 𝚙𝚝 h t ) + log ( | 𝒵 | / δ ) absent subscript superscript 𝑧 𝒵 subscript 𝐿 𝑡 superscript 𝑧 2 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 𝒵 𝛿 \displaystyle\leq-\sum_{z^{\prime}\in\mathcal{Z}}L_{h,t}(z^{\prime})/2\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})+\log\left(|\mathcal{Z}|/\delta\right)
2 log ( c 𝒵 | 𝒵 | / δ ) , absent 2 subscript 𝑐 𝒵 𝒵 𝛿 \displaystyle\leq 2\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right), (C.15)

where the first inequality results from ( C.11 ), and the last inequality follows ( C.14 ) and Assumption 4.5 , which indicates that 1 / 𝒫 𝒵 ( z ) c 𝒵 | 𝒵 | 1 subscript 𝒫 𝒵 𝑧 subscript 𝑐 𝒵 𝒵 1/\mathcal{P}_{\mathcal{Z}}(z)\leq c_{\mathcal{Z}}|\mathcal{Z}| . Thus, we conclude the proof of Lemma C.1 . \Box

C.4 Proof of Proposition 4.3

Our construction of the hard-to-distinguish example is a natural extension to the hard instance for the contextual bandit problem in Proposition 1 (Zhang,, 2022 ) .

Proof of Proposition 4.3 ..

Suppose that the high-level POMDP is fully observable, i.e., 𝕆 ( s ) = s 𝕆 𝑠 𝑠 \mathbb{O}(s)=s , with H = 2 𝐻 2 H=2 and | Ω | Ω |\Omega| =1. Consider 𝒮 = { s 1 , s 2 , s 3 } 𝒮 subscript 𝑠 1 subscript 𝑠 2 subscript 𝑠 3 \mathcal{S}=\{s_{1},s_{2},s_{3}\} with rewards r ( s 1 ) = 0.5 𝑟 subscript 𝑠 1 0.5 r(s_{1})=0.5 , r ( s 2 ) = 1 𝑟 subscript 𝑠 2 1 r(s_{2})=1 , r ( s 3 ) = 0 𝑟 subscript 𝑠 3 0 r(s_{3})=0 , 𝒢 = { g 1 , g 2 } 𝒢 subscript 𝑔 1 subscript 𝑔 2 \mathcal{G}=\{g_{1},g_{2}\} , and 𝒵 = { z 1 , , z N } 𝒵 subscript 𝑧 1 subscript 𝑧 𝑁 \mathcal{Z}=\{z_{1},\dots,z_{N}\} . Starting from initial state s 1 subscript 𝑠 1 s_{1} , the transition kernel follows

{ z i ( s 1 | s 1 , g 1 ) = 1 , z i ( s 2 | s 1 , g 1 ) = 0 , z i ( s 3 | s 1 , g 1 ) = 0 , i [ N ] , z 1 ( s 1 | s 1 , g 2 ) = 0 , z 1 ( s 1 | s 1 , g 2 ) = 1 , z 1 ( s 3 | s 1 , g 2 ) = 0 , if i = 1 , z i ( s 1 | s 1 , g 2 ) = 0 , z i ( s 2 | s 1 , g 2 ) = p i , z i ( s 3 | s 1 , g 2 ) = 1 p i , if i 1 , \left\{\begin{aligned} &\mathbb{P}_{z_{i}}(s_{1}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{1})=1,\quad\mathbb{P}_{z_{i}}(s_{2}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{1})=0,\quad\mathbb{P}_{z_{i}}(s_{3}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{1})=0,\hskip 34.14322pt\forall i\in[N],\\ &\mathbb{P}_{z_{1}}(s_{1}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{2})=0,\quad\mathbb{P}_{z_{1}}(s_{1}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{2})=1,\quad\mathbb{P}_{z_{1}}(s_{3}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{2})=0,\hskip 32.72049pt\text{if~{}}i=1,\\ &\mathbb{P}_{z_{i}}(s_{1}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{2})=0,\quad\mathbb{P}_{z_{i}}(s_{2}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{2})=p_{i},\quad\mathbb{P}_{z_{i}}(s_{3}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{2})=1-p_{i},\quad\text{if~{}}i\neq 1,\end{aligned}\right.

where p i = 0.5 ( 1 i N ) subscript 𝑝 𝑖 0.5 1 𝑖 𝑁 p_{i}=0.5(1-\frac{i}{N}) for all i [ N ] 𝑖 delimited-[] 𝑁 i\in[N] . For latent environment z 1 subscript 𝑧 1 z_{1} , the optimal policy is π z 1 , 1 ( s 1 ) = g 2 subscript superscript 𝜋 subscript 𝑧 1 1 subscript 𝑠 1 subscript 𝑔 2 \pi^{*}_{z_{1},1}(s_{1})=g_{2} and π z i , 1 ( s 1 ) = g 1 subscript superscript 𝜋 subscript 𝑧 𝑖 1 subscript 𝑠 1 subscript 𝑔 1 \pi^{*}_{z_{i},1}(s_{1})=g_{1} if i 1 𝑖 1 i\neq 1 . Suppose that prior distribution 𝒫 𝒵 subscript 𝒫 𝒵 \mathcal{P}_{\mathcal{Z}} is uniform. At t = 1 𝑡 1 t=1 , without any information, the posterior ( | 𝚙𝚝 1 ) \mathbb{P}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{1}) degenerates to prior 𝒫 𝒵 ( ) = Unif 𝒵 ( ) subscript 𝒫 𝒵 subscript Unif 𝒵 \mathcal{P}_{\mathcal{Z}}(\cdot)={\rm Unif}_{\mathcal{Z}}(\cdot) . Hence, the LLM’s policy at first step follows that π 𝙻𝙻𝙼 ( | s 1 ) = ( 1 1 N ) δ g 1 ( ) + 1 N δ g 2 ( ) . \pi_{\mathtt{LLM}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts_{1})=\left(1-\frac{1}{N}\right)\cdot\delta_{g_{1}}(\cdot)+\frac{1}{N}\cdot\delta_{g_{2}}(\cdot). Since z i ( s 1 | s 1 , g 1 ) = 1 subscript subscript 𝑧 𝑖 conditional subscript 𝑠 1 subscript 𝑠 1 subscript 𝑔 1 1 \mathbb{P}_{z_{i}}(s_{1}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{1})=1 and z i ( s 2 | s 1 , g 1 ) = z i ( s 3 | s 1 , g 1 ) = 0 subscript subscript 𝑧 𝑖 conditional subscript 𝑠 2 subscript 𝑠 1 subscript 𝑔 1 subscript subscript 𝑧 𝑖 conditional subscript 𝑠 3 subscript 𝑠 1 subscript 𝑔 1 0 \mathbb{P}_{z_{i}}(s_{2}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{1})=\mathbb{P}_{z_{i}}(s_{3}\hskip 1.42262pt|\hskip 1.42262pts_{1},g_{1})=0 for all i [ N ] 𝑖 delimited-[] 𝑁 i\in[N] , taking subgoal g 1 subscript 𝑔 1 g_{1} provides no information to differentiate z i subscript 𝑧 𝑖 z_{i} from others, and the posterior remains uniform. Such situation, i.e., ( | 𝚙𝚝 t ) = Unif 𝒵 ( ) \mathbb{P}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{t})={\rm Unif}_{\mathcal{Z}}(\cdot) , ends only if the LLM suggests taking g 2 subscript 𝑔 2 g_{2} at some epsiode t 𝑡 t . Consider the hard trajectory τ hard = { s 1 , g 1 , s 1 } t [ T ] subscript 𝜏 hard subscript subscript 𝑠 1 subscript 𝑔 1 subscript 𝑠 1 𝑡 delimited-[] 𝑇 \tau_{\rm hard}=\{s_{1},g_{1},s_{1}\}_{t\in[T]} , where LLM consistently adheres to the initial π 𝙻𝙻𝙼 subscript 𝜋 𝙻𝙻𝙼 \pi_{\mathtt{LLM}} and keeps recommending subgoal g 1 subscript 𝑔 1 g_{1} . Thus, we have z 1 ( τ hard ) = ( 1 1 / N ) T subscript subscript 𝑧 1 subscript 𝜏 hard superscript 1 1 𝑁 𝑇 \mathbb{P}_{z_{1}}(\tau_{\rm hard})=(1-1/N)^{T} , indicating Reg z 1 ( T ) 0.5 T ( 1 1 / N ) T subscript Reg subscript 𝑧 1 𝑇 0.5 𝑇 superscript 1 1 𝑁 𝑇 {\rm Reg}_{z_{1}}(T)\geq 0.5T\cdot(1-1/N)^{T} . ∎

Appendix D Proof for Section 5 : Practical Setting

D.1 Proof of Theorem 5.5

Proof of Theorem 5.5 . Recall that the binary discriminator for label y { 0 , 1 } 𝑦 0 1 y\in\{0,1\} is defined as

𝔻 γ ( y | o , s ) := ( f γ ( o , s ) 1 + f γ ( o , s ) ) y ( 1 1 + f γ ( o , s ) ) 1 y , assign subscript 𝔻 𝛾 conditional 𝑦 𝑜 𝑠 superscript subscript 𝑓 𝛾 𝑜 𝑠 1 subscript 𝑓 𝛾 𝑜 𝑠 𝑦 superscript 1 1 subscript 𝑓 𝛾 𝑜 𝑠 1 𝑦 \mathbb{D}_{\gamma}(y\hskip 1.42262pt|\hskip 1.42262pto,s):=\left(\frac{f_{\gamma}(o,s)}{1+f_{\gamma}(o,s)}\right)^{y}\left(\frac{1}{1+f_{\gamma}(o,s)}\right)^{1-y},

and the contrastive learning algorithm in ( 3.8 ) follows γ ^ = argmax γ Γ 𝔼 ^ 𝒟 𝚁𝚎𝚙 [ log 𝔻 γ ( y | o , s ) ] ^ 𝛾 subscript argmax 𝛾 Γ subscript ^ 𝔼 subscript 𝒟 𝚁𝚎𝚙 delimited-[] subscript 𝔻 𝛾 conditional 𝑦 𝑜 𝑠 \widehat{\gamma}={\rm argmax}_{\gamma\in\Gamma}\ \mathbb{\widehat{E}}_{\mathcal{D}_{\mathtt{Rep}}}\big{[}\log\mathbb{D}_{\gamma}(y\hskip 1.42262pt|\hskip 1.42262pto,s)\big{]} , and thus f γ ^ subscript 𝑓 ^ 𝛾 f_{\widehat{\gamma}} is the maximum likelihood estimator (MLE) concerning the dataset 𝒟 𝚁𝚎𝚙 subscript 𝒟 𝚁𝚎𝚙 \mathcal{D}_{\mathtt{Rep}} . Based on Lemma F.3 , the MLE-type algorithm ensures that, with probability at least 1 δ 1 𝛿 1-\delta , it holds that

𝔼 ¯ ( o , s ) 𝒟 𝚁𝚎𝚙 [ D TV 2 ( 𝔻 γ ^ ( | o , s ) , 𝔻 ( | o , s ) ) ] 2 log ( N p T p H | γ | / δ ) / N p T p H , \bar{\mathbb{E}}_{(o,s)\sim\mathcal{D}_{\mathtt{Rep}}}\left[D_{\rm TV}^{2}\left(\mathbb{D}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s),\mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s)\right)\right]\leq{2\log(N_{\rm p}T_{\rm p}H|\mathcal{F}_{\gamma}|/\delta)}/{N_{\rm p}T_{\rm p}H}, (D.1)

where 𝔻 ( | o , s ) = 𝔻 γ ( | o , s ) \mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s)=\mathbb{D}_{\gamma^{*}}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s) with f γ = f γ subscript 𝑓 superscript 𝛾 superscript 𝑓 subscript 𝛾 f_{\gamma^{*}}=f^{*}\in\mathcal{F}_{\gamma} denotes the ground-truth discriminator based on the realizability in Assumption 5.4 . Based on the definition of total variation, it holds that

D TV 2 ( 𝔻 γ ^ ( | o , s ) , 𝔻 ( | o , s ) ) \displaystyle D_{\rm TV}^{2}\left(\mathbb{D}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s),\mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s)\right)
= ( f γ ^ ( o , s ) f ( o , s ) ( 1 + f γ ^ ( o , s ) ) ( 1 + f ( o , s ) ) ) 2 1 ( 1 + R ) 2 ( f γ ^ ( o , s ) f ( o , s ) 1 + f ( o , s ) ) 2 absent superscript subscript 𝑓 ^ 𝛾 𝑜 𝑠 superscript 𝑓 𝑜 𝑠 1 subscript 𝑓 ^ 𝛾 𝑜 𝑠 1 superscript 𝑓 𝑜 𝑠 2 1 superscript 1 subscript 𝑅 2 superscript subscript 𝑓 ^ 𝛾 𝑜 𝑠 superscript 𝑓 𝑜 𝑠 1 superscript 𝑓 𝑜 𝑠 2 \displaystyle\quad=\left(\frac{f_{\widehat{\gamma}}(o,s)-f^{*}(o,s)}{(1+f_{\widehat{\gamma}}(o,s))(1+f^{*}(o,s))}\right)^{2}\leq\frac{1}{(1+R_{\mathcal{F}})^{2}}\left(\frac{f_{\widehat{\gamma}}(o,s)-f^{*}(o,s)}{1+f^{*}(o,s)}\right)^{2}
= 1 ( 1 + R ) 2 ( 𝕆 γ ^ ( o | s ) 𝕆 ( o | s ) 𝒫 ( o ) + 𝕆 ( o | s ) ) 2 = 1 ( 1 + R ) 2 ( 𝕆 ¯ γ ^ ( o | s ) 𝕆 ¯ ( o | s ) 𝕆 ¯ ( o | s ) ) 2 , absent 1 superscript 1 subscript 𝑅 2 superscript subscript 𝕆 ^ 𝛾 conditional 𝑜 𝑠 𝕆 conditional 𝑜 𝑠 superscript 𝒫 𝑜 𝕆 conditional 𝑜 𝑠 2 1 superscript 1 subscript 𝑅 2 superscript subscript ¯ 𝕆 ^ 𝛾 conditional 𝑜 𝑠 ¯ 𝕆 conditional 𝑜 𝑠 ¯ 𝕆 conditional 𝑜 𝑠 2 \displaystyle\quad=\frac{1}{(1+R_{\mathcal{F}})^{2}}\left(\frac{\mathbb{O}_{\widehat{\gamma}}(o\hskip 1.42262pt|\hskip 1.42262pts)-\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)}{\mathcal{P}^{-}(o)+\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)}\right)^{2}=\frac{1}{(1+R_{\mathcal{F}})^{2}}\left(\frac{\mathbb{\bar{O}}_{\widehat{\gamma}}(o\hskip 1.42262pt|\hskip 1.42262pts)-\mathbb{\bar{O}}(o\hskip 1.42262pt|\hskip 1.42262pts)}{\mathbb{\bar{O}}(o\hskip 1.42262pt|\hskip 1.42262pts)}\right)^{2}, (D.2)

where the first inequality results from f R subscript norm 𝑓 subscript 𝑅 \|f\|_{\infty}\leq R_{\mathcal{F}} for all f γ 𝑓 subscript 𝛾 f\in\mathcal{F}_{\gamma} , the third equation arise from the definition that 𝕆 γ ( | s ) = f γ ( , s ) 𝒫 ( ) \mathbb{O}_{\gamma}(\cdot|s)=f_{\gamma}(\cdot,s)\cdot\mathcal{P}^{-}(\cdot) , and we write 𝕆 ¯ ( | s ) = 1 2 ( 𝕆 ( | s ) + 𝒫 ( ) ) , 𝕆 ¯ γ ( | s ) = 1 2 ( 𝕆 γ ( | s ) + 𝒫 ( ) ) \mathbb{\bar{O}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)=\frac{1}{2}\left(\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)+\mathcal{P}^{-}(\cdot)\right),\mathbb{\bar{O}}_{{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)=\frac{1}{2}\left(\mathbb{O}_{{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)+\mathcal{P}^{-}(\cdot)\right) . Moreover, 𝕆 ¯ ( | s ) \mathbb{\bar{O}}(\cdot|s) represents the marginal distribution derived from the joint distribution 𝒞 subscript 𝒞 \mathbb{P}_{\mathcal{C}} of collected dataset 𝒟 𝚁𝚎𝚙 subscript 𝒟 𝚁𝚎𝚙 \mathcal{D}_{\mathtt{Rep}} (see data collection process in § 3.2 ), as follows:

𝒞 ( o | s ) subscript 𝒞 conditional 𝑜 𝑠 \displaystyle\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts) = 𝒞 ( o | s , y = 0 ) 𝒞 ( y = 0 | s ) + 𝒞 ( o | s , y = 1 ) 𝒞 ( y = 1 | s ) absent subscript 𝒞 conditional 𝑜 𝑠 𝑦 0 subscript 𝒞 𝑦 conditional 0 𝑠 subscript 𝒞 conditional 𝑜 𝑠 𝑦 1 subscript 𝒞 𝑦 conditional 1 𝑠 \displaystyle=\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y=0)\cdot\mathbb{P}_{\mathcal{C}}(y=0\hskip 1.42262pt|\hskip 1.42262pts)+\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y=1)\cdot\mathbb{P}_{\mathcal{C}}(y=1\hskip 1.42262pt|\hskip 1.42262pts)
= 𝒞 ( o | s , y = 0 ) 𝒞 ( y = 0 ) + 𝒞 ( o | s , y = 1 ) 𝒞 ( y = 1 ) := 𝕆 ¯ ( o | s ) , absent subscript 𝒞 conditional 𝑜 𝑠 𝑦 0 subscript 𝒞 𝑦 0 subscript 𝒞 conditional 𝑜 𝑠 𝑦 1 subscript 𝒞 𝑦 1 assign ¯ 𝕆 conditional 𝑜 𝑠 \displaystyle=\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y=0)\cdot\mathbb{P}_{\mathcal{C}}(y=0)+\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y=1)\cdot\mathbb{P}_{\mathcal{C}}(y=1):=\mathbb{\bar{O}}(o\hskip 1.42262pt|\hskip 1.42262pts), (D.3)

where the second equation results from the fact that contrastive data are labeled independent of data itself such that 𝒞 ( s | y ) = 𝒞 ( s ) subscript 𝒞 conditional 𝑠 𝑦 subscript 𝒞 𝑠 \mathbb{P}_{\mathcal{C}}(s\hskip 1.42262pt|\hskip 1.42262pty)=\mathbb{P}_{\mathcal{C}}(s) for all y { 0 , 1 } 𝑦 0 1 y\in\{0,1\} . Based on ( D.3 ), we can get

𝔼 ¯ ( o , s ) 𝒟 𝚁𝚎𝚙 [ ( 𝕆 ¯ γ ^ ( o | s ) 𝕆 ¯ ( o | s ) 𝕆 ¯ ( o | s ) ) 2 ] = 𝔼 ¯ s 𝒟 𝚁𝚎𝚙 [ 𝔼 o 𝕆 ¯ ( | s ) [ ( 𝕆 ¯ γ ^ ( | s ) 𝕆 ¯ ( | s ) 𝕆 ¯ ( | s ) ) 2 ] ] , \displaystyle\bar{\mathbb{E}}_{(o,s)\sim\mathcal{D}_{\mathtt{Rep}}}\left[\left(\frac{\mathbb{\bar{O}}_{\widehat{\gamma}}(o\hskip 1.42262pt|\hskip 1.42262pts)-\mathbb{\bar{O}}(o\hskip 1.42262pt|\hskip 1.42262pts)}{\mathbb{\bar{O}}(o\hskip 1.42262pt|\hskip 1.42262pts)}\right)^{2}\right]=\bar{\mathbb{E}}_{s\sim\mathcal{D}_{\mathtt{Rep}}}\left[\mathbb{E}_{o\sim\mathbb{\bar{O}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)}\left[\left(\frac{\mathbb{\bar{O}}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)-\mathbb{\bar{O}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)}{\mathbb{\bar{O}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)}\right)^{2}\right]\right], (D.4)

where equations results from the fact that 𝒞 ( o , s ) = 𝕆 ¯ ( o | s ) 𝒞 ( s ) subscript 𝒞 𝑜 𝑠 ¯ 𝕆 conditional 𝑜 𝑠 subscript 𝒞 𝑠 \mathbb{P}_{\mathcal{C}}(o,s)=\mathbb{\bar{O}}(o\hskip 1.42262pt|\hskip 1.42262pts)\cdot\mathbb{P}_{\mathcal{C}}(s) and definition of χ 2 superscript 𝜒 2 \chi^{2} -divergence. Therefore, combine ( D.2 ) and ( D.4 ), it holds that

𝔼 ¯ ( o , s ) 𝒟 𝚁𝚎𝚙 [ D TV 2 ( 𝔻 γ ^ ( | o , s ) , 𝔻 ( | o , s ) ) ] 1 ( 1 + R ) 2 𝔼 ¯ s 𝒟 𝚁𝚎𝚙 [ χ 2 ( 𝕆 ¯ γ ^ ( | s ) 𝕆 ¯ ( | s ) ) ] . \displaystyle\bar{\mathbb{E}}_{(o,s)\sim\mathcal{D}_{\mathtt{Rep}}}\left[D_{\rm TV}^{2}\left(\mathbb{D}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s),\mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s)\right)\right]\leq\frac{1}{(1+R_{\mathcal{F}})^{2}}\cdot\bar{\mathbb{E}}_{s\sim\mathcal{D}_{\mathtt{Rep}}}\left[\chi^{2}\left(\mathbb{\bar{O}}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\bar{O}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right)\right]. (D.5)

Based on the variational representation of f 𝑓 f -divergenve (§7.13, Polyanskiy and Wu,, 2022 ) , we have

χ 2 ( 𝕆 ¯ γ ^ ( | s ) 𝕆 ¯ ( | s ) ) \displaystyle\chi^{2}\left(\mathbb{\bar{O}}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\bar{O}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right) = sup g : 𝒪 { ( 𝔼 𝕆 ¯ γ ^ [ g ( o ) | s ] 𝔼 𝕆 ¯ [ g ( o ) | s ] ) 2 Var 𝕆 ¯ [ g ( o ) | s ] } absent subscript supremum : 𝑔 maps-to 𝒪 superscript subscript 𝔼 subscript ¯ 𝕆 ^ 𝛾 delimited-[] conditional 𝑔 𝑜 𝑠 subscript 𝔼 ¯ 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 2 subscript Var ¯ 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 \displaystyle=\sup_{g:\mathcal{O}\mapsto\mathbb{R}}\left\{\frac{\left(\mathbb{E}_{\mathbb{\bar{O}}_{\widehat{\gamma}}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]-\mathbb{E}_{\mathbb{\bar{O}}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]\right)^{2}}{{\rm Var}_{\mathbb{\bar{O}}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]}\right\}
= sup g : 𝒪 { ( 𝔼 𝕆 γ ^ [ g ( o ) | s ] 𝔼 𝕆 [ g ( o ) | s ] ) 2 4 Var 𝕆 [ g ( o ) | s ] Var 𝕆 [ g ( o ) | s ] Var 𝕆 ¯ [ g ( o ) | s ] } absent subscript supremum : 𝑔 maps-to 𝒪 superscript subscript 𝔼 subscript 𝕆 ^ 𝛾 delimited-[] conditional 𝑔 𝑜 𝑠 subscript 𝔼 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 2 4 subscript Var 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 subscript Var 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 subscript Var ¯ 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 \displaystyle=\sup_{g:\mathcal{O}\mapsto\mathbb{R}}\left\{\frac{\left(\mathbb{E}_{\mathbb{O}_{\widehat{\gamma}}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]-\mathbb{E}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]\right)^{2}}{4\cdot{\rm Var}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]}\cdot\frac{{\rm Var}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]}{{\rm Var}_{\mathbb{\bar{O}}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]}\right\}
sup g : 𝒪 , 𝔼 𝕆 [ g ( o ) | s ] = 0 { ( 𝔼 𝕆 γ ^ [ g ( o ) | s ] 𝔼 𝕆 [ g ( o ) | s ] ) 2 4 Var 𝕆 [ g ( o ) | s ] 𝔼 𝕆 [ g ( o ) 2 | s ] 𝔼 𝕆 ¯ [ g ( o ) 2 | s ] } , absent subscript supremum : 𝑔 maps-to 𝒪 subscript 𝔼 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 0 superscript subscript 𝔼 subscript 𝕆 ^ 𝛾 delimited-[] conditional 𝑔 𝑜 𝑠 subscript 𝔼 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 2 4 subscript Var 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 subscript 𝔼 𝕆 delimited-[] conditional 𝑔 superscript 𝑜 2 𝑠 subscript 𝔼 ¯ 𝕆 delimited-[] conditional 𝑔 superscript 𝑜 2 𝑠 \displaystyle\geq\sup_{\begin{subarray}{c}g:\mathcal{O}\mapsto\mathbb{R},\\ \mathbb{E}_{\mathbb{O}}[g(o)|s]=0\end{subarray}}\left\{\frac{\left(\mathbb{E}_{\mathbb{O}_{\widehat{\gamma}}}[g(o)|s]-\mathbb{E}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]\right)^{2}}{4\cdot{\rm Var}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]}\cdot\frac{\mathbb{E}_{\mathbb{O}}[g(o)^{2}\hskip 1.42262pt|\hskip 1.42262pts]}{\mathbb{E}_{\mathbb{\bar{O}}}[g(o)^{2}\hskip 1.42262pt|\hskip 1.42262pts]}\right\}, (D.6)

where the second equation follows the defintions of 𝕆 ¯ ( | s ) \mathbb{\bar{O}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts) and 𝕆 ¯ γ ^ ( | s ) \mathbb{\bar{O}}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts) , and the inequality results from Var 𝕆 ¯ [ g ( o ) | s ] = 𝔼 𝕆 ¯ [ g ( o ) 2 | s ] subscript Var ¯ 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 subscript 𝔼 ¯ 𝕆 delimited-[] conditional 𝑔 superscript 𝑜 2 𝑠 {\rm Var}_{\mathbb{\bar{O}}}[g(o)|s]=\mathbb{E}_{\mathbb{\bar{O}}}[g(o)^{2}|s] if 𝔼 𝕆 γ ^ [ g ( o ) | s ] subscript 𝔼 subscript 𝕆 ^ 𝛾 delimited-[] conditional 𝑔 𝑜 𝑠 \mathbb{E}_{\mathbb{O}_{\widehat{\gamma}}}[g(o)|s] =0. Furthermore, note that

𝔼 𝕆 [ g ( o ) 2 | s ] 𝔼 𝕆 ¯ [ g ( o ) 2 | s ] subscript 𝔼 𝕆 delimited-[] conditional 𝑔 superscript 𝑜 2 𝑠 subscript 𝔼 ¯ 𝕆 delimited-[] conditional 𝑔 superscript 𝑜 2 𝑠 \displaystyle\frac{\mathbb{E}_{\mathbb{O}}[g(o)^{2}\hskip 1.42262pt|\hskip 1.42262pts]}{\mathbb{E}_{\mathbb{\bar{O}}}[g(o)^{2}\hskip 1.42262pt|\hskip 1.42262pts]} = 2 ( 1 + 𝔼 𝒫 [ g ( o ) 2 | s ] 𝔼 𝕆 [ g ( o ) 2 | s ] ) 1 2 ( 1 + 𝒫 ( ) 𝕆 ( | s ) ) 1 2 ( 1 + B ) 1 , \displaystyle=2\left(1+\frac{\mathbb{E}_{\mathcal{P}^{-}}[g(o)^{2}\hskip 1.42262pt|\hskip 1.42262pts]}{\mathbb{E}_{\mathbb{O}}[g(o)^{2}\hskip 1.42262pt|\hskip 1.42262pts]}\right)^{-1}\leq 2\left(1+\left\|\frac{\mathcal{P}^{-}(\cdot)}{\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)}\right\|_{\infty}\right)^{-1}\leq 2(1+B^{-}_{\mathcal{F}})^{-1}, (D.7)

as 𝒫 ( ) / ( | s ) = f γ \mathcal{P}^{-}(\cdot)/\mathbb{P}(\cdot|s)=f^{*}\in\mathcal{F}_{\gamma} and 1 / f B subscript norm 1 𝑓 subscript superscript 𝐵 \|1/f\|_{\infty}\leq B^{-}_{\mathcal{F}} for all f 𝑓 f\in\mathcal{F} under the realizability in Assumption 5.4 . Besides, it holds that

sup g : 𝒪 , 𝔼 𝕆 [ g ( o ) | s ] = 0 { ( 𝔼 𝕆 γ ^ [ g ( o ) | s ] 𝔼 𝕆 [ g ( o ) | s ] ) 2 Var 𝕆 [ g ( o ) | s ] } subscript supremum : 𝑔 maps-to 𝒪 subscript 𝔼 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 0 superscript subscript 𝔼 subscript 𝕆 ^ 𝛾 delimited-[] conditional 𝑔 𝑜 𝑠 subscript 𝔼 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 2 subscript Var 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 \displaystyle\sup_{\begin{subarray}{c}g:\mathcal{O}\mapsto\mathbb{R},\\ \mathbb{E}_{\mathbb{O}}[g(o)|s]=0\end{subarray}}\left\{\frac{\left(\mathbb{E}_{\mathbb{O}_{\widehat{\gamma}}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]-\mathbb{E}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]\right)^{2}}{{\rm Var}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]}\right\} = sup g : 𝒪 { ( 𝔼 𝕆 γ ^ [ g ( o ) | s ] 𝔼 𝕆 [ g ( o ) | s ] ) 2 Var 𝕆 [ g ( o ) | s ] } absent subscript supremum : 𝑔 maps-to 𝒪 superscript subscript 𝔼 subscript 𝕆 ^ 𝛾 delimited-[] conditional 𝑔 𝑜 𝑠 subscript 𝔼 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 2 subscript Var 𝕆 delimited-[] conditional 𝑔 𝑜 𝑠 \displaystyle=\sup_{g:\mathcal{O}\mapsto\mathbb{R}}\left\{\frac{\left(\mathbb{E}_{\mathbb{O}_{\widehat{\gamma}}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]-\mathbb{E}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]\right)^{2}}{{\rm Var}_{\mathbb{O}}[g(o)\hskip 1.42262pt|\hskip 1.42262pts]}\right\}
= χ 2 ( 𝕆 γ ^ ( | s ) 𝕆 ( | s ) ) , \displaystyle=\chi^{2}\left(\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right), (D.8)

Based on ( D.1 ), ( D.5 ), ( D.6 ), ( D.7 ) and ( D.8 ), then we have

𝔼 ¯ 𝒟 𝚁𝚎𝚙 [ χ 2 ( 𝕆 γ ^ ( | s ) 𝕆 ( | s ) ) ] 𝒪 ( ( 1 + B ) ( 1 + B ) 2 N p T p H log ( N p T p H | | / δ ) ) . \bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{Rep}}}\left[\chi^{2}\left(\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right)\right]\leq\mathcal{O}\left(\frac{(1+B^{-}_{\mathcal{F}})(1+B_{\mathcal{F}})^{2}}{N_{\rm p}T_{\rm p}H}\cdot\log(N_{\rm p}T_{\rm p}H|\mathcal{F}|/\delta)\right). (D.9)

Combine ( D.9 ) and the divergence inequalities (§7.6, Polyanskiy and Wu,, 2022 ) , we have

𝔼 ¯ 𝒟 𝚁𝚎𝚙 [ D TV ( 𝕆 γ ^ ( | s ) 𝕆 ( | s ) ) ] 1 2 𝔼 ¯ 𝒟 𝚁𝚎𝚙 [ χ 2 ( 𝕆 γ ^ ( | s ) 𝕆 ( | s ) ) ] \displaystyle\bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{Rep}}}\left[D_{\rm TV}\left(\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right)\right]\leq\frac{1}{2}\cdot\bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{Rep}}}\left[\sqrt{\chi^{2}\left(\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right)}\right]
1 2 𝔼 ¯ 𝒟 𝚁𝚎𝚙 [ χ 2 ( 𝕆 γ ^ ( | s ) 𝕆 ( | s ) ) ] 𝒪 ( B ( B ) 1 / 2 ( N p T p H ) 1 / 2 log ( N p T p H | γ | / δ ) ) , \displaystyle\hskip 14.22636pt\leq\frac{1}{2}\cdot\sqrt{\bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{Rep}}}\left[\chi^{2}\left(\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right)\right]}\leq\mathcal{O}\left(\frac{B_{\mathcal{F}}(B^{-}_{\mathcal{F}})^{1/2}}{(N_{\rm p}T_{\rm p}H)^{1/2}}\sqrt{\log(N_{\rm p}T_{\rm p}H|\mathcal{F}_{\gamma}|/\delta)}\right),

where the second inequality follows 𝔼 [ X ] 𝔼 [ X 2 ] 𝔼 delimited-[] 𝑋 𝔼 delimited-[] superscript 𝑋 2 \mathbb{E}[X]\leq\sqrt{\mathbb{E}[X^{2}]} and we finish the proof of Theorem 5.5 . \Box

D.2 Proof of Theorem 5.7

Notations.

Denote ( 𝒥 , 𝒥 ^ ) 𝒥 ^ 𝒥 (\mathcal{J},\mathcal{\widehat{J}}) , ( π z , π ^ z ) superscript subscript 𝜋 𝑧 superscript subscript ^ 𝜋 𝑧 ({\pi}_{z}^{*},\widehat{\pi}_{z}^{*}) , and ( z , h , ^ z , h ) subscript 𝑧 subscript ^ 𝑧 (\mathbb{P}_{z,h},\mathbb{\widehat{P}}_{z,h}) as the value functions, optimal policies, and probability distributions under the environment concerning the ground-truth 𝕆 𝕆 \mathbb{O} and the pretrained 𝕆 γ ^ subscript 𝕆 ^ 𝛾 \mathbb{O}_{\widehat{\gamma}} . Furthermore, ( π t , π ^ t ) superscript 𝜋 𝑡 superscript ^ 𝜋 𝑡 (\pi^{t},\widehat{\pi}^{t}) are the Planner’s policy empowered by perfect 𝙻𝙻𝙼 𝙻𝙻𝙼 \mathtt{LLM} or pretrained 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} .
Proof of Theorem 5.7 . Conditioned on the event 1 subscript 1 \mathcal{E}_{1} that both Theorem 2 and 5.5 hold, the regret under the practical setting can be decomposed as

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) t = 1 T 𝒥 ^ z ( π ^ z , ω t ) 𝒥 z ( π ^ z , ω t ) (i ) + t = 1 T 𝒥 z ( π ^ z , ω t ) 𝒥 z ( π z , ω t ) (ii ) \displaystyle\leq\underbrace{\sum_{t=1}^{T}\mathcal{\widehat{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})}_{\textbf{(i})}+\underbrace{\sum_{t=1}^{T}\mathcal{{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{{J}}_{z}({\pi}_{z}^{*},\omega^{t})}_{\textbf{(ii})}
+ t = 1 T 𝒥 z ( π z , ω t ) 𝒥 ^ z ( π z , ω t ) (iii ) + t = 1 T 𝔼 t [ 𝒥 ^ z ( π z , ω t ) 𝒥 ^ z ( π ^ t , ω t ) ] (iv ) , \displaystyle\quad+\underbrace{\sum_{t=1}^{T}\mathcal{{J}}_{z}({\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{z}({\pi}_{z}^{*},\omega^{t})}_{\textbf{(iii})}+\underbrace{\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{z}({\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{z}(\widehat{\pi}^{t},\omega^{t})\right]}_{\textbf{(iv})}, (D.10)

and (ii ) 0 \textbf{(ii})\leq 0 results from the optimality such that 𝒥 z ( π ^ z , ω t ) 𝒥 z ( π z , ω t ) subscript 𝒥 𝑧 superscript subscript ^ 𝜋 𝑧 superscript 𝜔 𝑡 subscript 𝒥 𝑧 superscript subscript 𝜋 𝑧 superscript 𝜔 𝑡 \mathcal{{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})\leq\mathcal{{J}}_{z}({\pi}_{z}^{*},\omega^{t}) for all t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] .

Step 1. Bound (i) and (iii) with Translator’s Pretraining Error.
For any policy sequence { π t } t T Π subscript subscript 𝜋 𝑡 𝑡 𝑇 Π \{\pi_{t}\}_{t\leq T}\subseteq\Pi and length T 𝑇 T\in\mathbb{N} , based on PDL in Lemma F.4 , we have

t = 1 T 𝒥 ^ z ( π t , ω t ) 𝒥 z ( π t , ω t ) superscript subscript 𝑡 1 𝑇 subscript ^ 𝒥 𝑧 subscript 𝜋 𝑡 superscript 𝜔 𝑡 subscript 𝒥 𝑧 subscript 𝜋 𝑡 superscript 𝜔 𝑡 \displaystyle\sum_{t=1}^{T}\mathcal{\widehat{J}}_{z}(\pi_{t},\omega^{t})-\mathcal{{J}}_{z}(\pi_{t},\omega^{t})
= t = 1 T h = 1 H 𝔼 ( s h t , τ h t , g h t ) z π t [ ( z , h V ^ h π t ^ z , h V ^ h π t ) ( s h t , τ h t , g h t , ω t ) ] absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑔 𝑡 superscript subscript 𝑧 subscript 𝜋 𝑡 delimited-[] subscript 𝑧 superscript subscript ^ 𝑉 subscript 𝜋 𝑡 subscript ^ 𝑧 superscript subscript ^ 𝑉 subscript 𝜋 𝑡 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑔 𝑡 superscript 𝜔 𝑡 \displaystyle\quad=\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t},g_{h}^{t})\sim\mathbb{P}_{z}^{\pi_{t}}}\left[(\mathbb{P}_{z,h}\widehat{V}_{h}^{\pi_{t}}-\mathbb{\widehat{P}}_{z,h}\widehat{V}_{h}^{\pi_{t}})(s_{h}^{t},\tau_{h}^{t},g_{h}^{t},\omega^{t})\right]
H t = 1 T h = 1 H 𝔼 ( s h t , τ h t , g h t ) z π t [ D TV ( z , h ( , | s h t , τ h t , g h t ) , ^ z , h ( , | s h t , g h t , τ h t ) ) ] \displaystyle\quad\leq H\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t},g_{h}^{t})\sim\mathbb{P}_{z}^{\pi_{t}}}\left[D_{\rm TV}\left(\mathbb{P}_{z,h}(\cdot,\cdot\hskip 1.42262pt|\hskip 1.42262pts_{h}^{t},\tau_{h}^{t},g_{h}^{t}),\mathbb{\widehat{P}}_{z,h}(\cdot,\cdot\hskip 1.42262pt|\hskip 1.42262pts_{h}^{t},g_{h}^{t},\tau_{h}^{t})\right)\right]
H t = 1 T h = 1 H 𝔼 ( s h t , g h t ) z π t 𝔼 s h + 1 t z , h ( | s h t , g h t ) [ D TV ( 𝕆 ( | s h + 1 t ) , 𝕆 γ ^ ( | s h + 1 t ) ) ] , \displaystyle\quad\leq H\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{(s_{h}^{t},g_{h}^{t})\sim\mathbb{P}_{z}^{\pi_{t}}}\mathbb{E}_{s_{h+1}^{t}\sim\mathbb{P}_{z,h}(\cdot\hskip 1.42262pt|\hskip 1.42262pts_{h}^{t},g_{h}^{t})}\left[D_{\rm TV}\left(\mathbb{O}(\cdot|s_{h+1}^{t}),\mathbb{O}_{\widehat{\gamma}}(\cdot|s_{h+1}^{t})\right)\right], (D.11)

where the last inequality results from the fact that for any f 𝑓 f -divergence, it holds that

D f ( Y | X X , Y | X X ) = 𝔼 X X [ D f ( Y | X , Y | X ) ] . subscript 𝐷 𝑓 tensor-product subscript conditional 𝑌 𝑋 subscript 𝑋 tensor-product subscript conditional 𝑌 𝑋 subscript 𝑋 subscript 𝔼 similar-to 𝑋 subscript 𝑋 delimited-[] subscript 𝐷 𝑓 subscript conditional 𝑌 𝑋 subscript conditional 𝑌 𝑋 D_{f}(\mathbb{P}_{Y|X}\otimes\mathbb{P}_{X},\mathbb{Q}_{Y|X}\otimes\mathbb{P}_{X})=\mathbb{E}_{X\sim\mathbb{P}_{X}}[D_{f}(\mathbb{P}_{Y|X},\mathbb{Q}_{Y|X})].

Based on ( D.11 ), by taking policies π = π ^ z 𝜋 superscript subscript ^ 𝜋 𝑧 \pi=\widehat{\pi}_{z}^{*} and π = π z 𝜋 superscript subscript 𝜋 𝑧 \pi={\pi}_{z}^{*} respectively, we have

(i ) + (iii ) \displaystyle\textbf{\small(i})+\textbf{\small(iii}) = t = 1 T 𝒥 ^ z ( π ^ z , ω t ) 𝒥 z ( π ^ z , ω t ) + t = 1 T 𝒥 z ( π z , ω t ) 𝒥 ^ z ( π z , ω t ) absent superscript subscript 𝑡 1 𝑇 subscript ^ 𝒥 𝑧 superscript subscript ^ 𝜋 𝑧 superscript 𝜔 𝑡 subscript 𝒥 𝑧 superscript subscript ^ 𝜋 𝑧 superscript 𝜔 𝑡 superscript subscript 𝑡 1 𝑇 subscript 𝒥 𝑧 superscript subscript 𝜋 𝑧 superscript 𝜔 𝑡 subscript ^ 𝒥 𝑧 superscript subscript 𝜋 𝑧 superscript 𝜔 𝑡 \displaystyle=\sum_{t=1}^{T}\mathcal{\widehat{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})+\sum_{t=1}^{T}\mathcal{{J}}_{z}({\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{z}({\pi}_{z}^{*},\omega^{t})
2 H 2 T max s 𝒮 { D TV ( 𝕆 ( | s ) , 𝕆 γ ^ ( | s ) ) } 2 H 2 T λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) , \displaystyle\leq 2H^{2}T\cdot\max_{s\in\mathcal{S}}\left\{D_{\rm TV}\left(\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts),\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\right)\right\}\leq 2H^{2}T\lambda_{R}^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta), (D.12)

where the last inequality results from Assumption 5.6 and Theorem 5.5 .
Step 2. Bound (iv) with LLM’s and Translator’s Pretraining Errors
Recall that the Planner follows a mixture policy of π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} and π ^ 𝙻𝙻𝙼 subscript ^ 𝜋 𝙻𝙻𝙼 \widehat{\pi}_{\mathtt{LLM}} as

π h t ( | τ h t , ω t ) ( 1 ϵ ) π ^ h , 𝙻𝙻𝙼 t ( | τ h t , ω t ) + ϵ π h , 𝚎𝚡𝚙 ( | τ h t ) . \pi_{h}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})\sim(1-\epsilon)\cdot\widehat{\pi}_{h,\mathtt{LLM}}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})+\epsilon\cdot\pi_{h,\mathtt{exp}}(\cdot|\tau_{h}^{t}). (D.13)

Based on PDL in Lemma F.4 , the performance difference in term (iv ) can be decomposed as

(iv ) \displaystyle\textbf{\small(iv}) = t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 ( s h t , τ h t ) ^ z π ^ t [ ( π z , h π ^ h t ) Q ^ h π z ( s h t , τ h t , ω t ) ] absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript ^ 𝑧 subscript ^ 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑡 delimited-[] subscript superscript 𝜋 𝑧 superscript subscript ^ 𝜋 𝑡 superscript subscript ^ 𝑄 superscript subscript 𝜋 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle=\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[(\pi^{*}_{z,h}-\widehat{\pi}_{h}^{t})\widehat{Q}_{h}^{\pi_{z}^{*}}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]
= t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 ( s h t , τ h t ) ^ z π ^ t [ ( π z , h π ^ h , 𝙻𝙻𝙼 t ) Q ^ h π z ( s h t , τ h t , ω t ) ] ( 1 ϵ ) absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript ^ 𝑧 subscript ^ 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑡 delimited-[] subscript superscript 𝜋 𝑧 subscript superscript ^ 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝑄 superscript subscript 𝜋 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 1 italic-ϵ \displaystyle=\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[\left(\pi^{*}_{z,h}-{\widehat{\pi}^{t}_{h,\mathtt{LLM}}}\right)\widehat{Q}_{h}^{\pi_{z}^{*}}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]\cdot(1-\epsilon)
+ t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 ( s h t , τ h t ) ^ z π ^ t [ ( π z , h π h , 𝚎𝚡𝚙 ) Q ^ h π z ( s h t , τ h t , ω t ) ] ϵ superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript ^ 𝑧 subscript ^ 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑡 delimited-[] subscript superscript 𝜋 𝑧 subscript 𝜋 𝚎𝚡𝚙 superscript subscript ^ 𝑄 superscript subscript 𝜋 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 italic-ϵ \displaystyle\quad+\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[\left(\pi^{*}_{z,h}-\pi_{h,\mathtt{exp}}\right)\widehat{Q}_{h}^{\pi_{z}^{*}}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]\cdot\epsilon
H t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ D TV ( π z , h ( | τ h t , ω t ) , 𝙻𝙻𝙼 θ ^ ( | 𝚙𝚝 h t ) ) ] + H T ϵ \displaystyle\leq H\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[D_{\rm TV}\left(\pi_{z,h}^{*}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}),\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right)\right]+HT\epsilon (D.14)

where we write π h Q h ( s h , τ h , ω ) = π h ( | τ h , ω ) , Q h ( s h , τ h , , ω ) 𝒢 \pi_{h}Q_{h}(s_{h},\tau_{h},\omega)=\langle\pi_{h}(\cdot|\tau_{h},\omega),Q_{h}(s_{h},\tau_{h},\cdot,\omega)\rangle_{\mathcal{G}} for all h [ H ] delimited-[] 𝐻 h\in[H] , and Q ^ h π superscript subscript ^ 𝑄 𝜋 \widehat{Q}_{h}^{\pi} denotes the action value function under the practical setting. Furthermore, we have

t = 1 T superscript subscript 𝑡 1 𝑇 \displaystyle\sum_{t=1}^{T} h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ D TV ( π z , h ( | τ h t , ω t ) , 𝙻𝙻𝙼 θ ^ ( | 𝚙𝚝 h t ) ) ] \displaystyle\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[D_{\rm TV}\left(\pi_{z,h}^{*}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}),\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right)\right]
t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ D TV ( 𝙻𝙻𝙼 θ ^ ( | 𝚙𝚝 h t ) , 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) ) ] \displaystyle\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[D_{\rm TV}\left(\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}),\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right)\right]
+ t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ D TV ( π z , h ( | τ h t , ω t ) , 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) ) ] \displaystyle\quad+\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[D_{\rm TV}\left(\pi_{z,h}^{*}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}),\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right)\right]
t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ D TV ( 𝙻𝙻𝙼 θ ^ ( | 𝚙𝚝 h t ) , 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) ) ] \displaystyle\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[D_{\rm TV}\left(\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}),\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right)\right]
+ t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ z z 𝒟 ( z | 𝚙𝚝 h t ) ] , superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript ^ 𝑧 subscript ^ 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝜏 𝑡 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑡 delimited-[] subscript superscript 𝑧 𝑧 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle\quad+\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[\sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right], (D.15)

where the first inequality arises from the triangle inequality, and the second inequality results from Thoerem 4.2 . Furthermore, the first term can be bounded by the pretraining error, following

t = 1 T h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ D TV ( 𝙻𝙻𝙼 θ ^ ( | 𝚙𝚝 h t ) , 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) ) ] \displaystyle\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[D_{\rm TV}\left(\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}),\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right)\right]
λ S t = 1 T h = 1 H 𝔼 ¯ 𝚙𝚝 h t 𝒟 𝙻𝙻𝙼 [ D TV ( 𝙻𝙻𝙼 θ ^ ( | 𝚙𝚝 h t ) , 𝙻𝙻𝙼 ( | 𝚙𝚝 h t ) ) ] , \displaystyle\quad\leq\lambda_{S}\cdot\sum_{t=1}^{T}\sum_{h=1}^{H}\bar{\mathbb{E}}_{\mathtt{pt}_{h}^{t}\sim\mathcal{D}_{\mathtt{LLM}}}\left[D_{\rm TV}\left(\mathtt{LLM}_{\widehat{\theta}}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}),\mathtt{LLM}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right)\right],
= λ S H T Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) , absent subscript 𝜆 𝑆 𝐻 𝑇 subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \displaystyle\quad=\lambda_{S}HT\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta), (D.16)

where the last inequality follows Theorem 2 and Assumption 5.6 . Under practical setting, 𝚙𝚝 h t superscript subscript 𝚙𝚝 𝑡 \mathtt{pt}_{h}^{t} is generated from practical transition ^ z subscript ^ 𝑧 \mathbb{\widehat{P}}_{z} , mismatching 𝒟 ( z | 𝚙𝚝 h t ) subscript 𝒟 conditional 𝑧 superscript subscript 𝚙𝚝 𝑡 \mathbb{P}_{\mathcal{D}}(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}) in pretraining. Let 𝒳 𝚎𝚡𝚙 t = { i [ t ] : π ^ i = π 𝚎𝚡𝚙 } subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 conditional-set 𝑖 delimited-[] 𝑡 superscript ^ 𝜋 𝑖 subscript 𝜋 𝚎𝚡𝚙 \mathcal{X}^{t}_{\mathtt{exp}}=\{i\in[t]:\widehat{\pi}^{i}=\pi_{\mathtt{exp}}\} and write τ ˘ h / t i = τ H i superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝜏 𝐻 𝑖 \breve{\tau}_{h/t}^{i}={\tau}_{H}^{i} for all i < t 𝑖 𝑡 i<t and τ ˘ h / t t = τ h t superscript subscript ˘ 𝜏 𝑡 𝑡 superscript subscript 𝜏 𝑡 \breve{\tau}_{h/t}^{t}={\tau}_{h}^{t} . Define the information gains as

L h , t 𝚎𝚡𝚙 ( z ) = i 𝒳 𝚎𝚡𝚙 t log ( z ( τ ˘ h / t i ) z ( τ ˘ h / t i ) ) , L h , t 𝙻𝙻𝙼 ( z ) = i [ t ] \ 𝒳 𝚎𝚡𝚙 t log ( z ( τ ˘ h / t i ) z ( τ ˘ h / t i ) ) , formulae-sequence superscript subscript 𝐿 𝑡 𝚎𝚡𝚙 superscript 𝑧 subscript 𝑖 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 subscript superscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝐿 𝑡 𝙻𝙻𝙼 superscript 𝑧 subscript 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 subscript superscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 L_{h,t}^{\mathtt{exp}}(z^{\prime})=\sum_{i\in\mathcal{X}^{t}_{\mathtt{exp}}}\log\left(\frac{\mathbb{P}_{z^{\prime}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}(\breve{\tau}_{h/t}^{i})}\right),\quad L_{h,t}^{\mathtt{LLM}}(z^{\prime})=\sum_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\log\left(\frac{\mathbb{P}_{z^{\prime}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}(\breve{\tau}_{h/t}^{i})}\right), (D.17)

where z ( τ h ) subscript 𝑧 subscript 𝜏 \mathbb{P}_{z}(\tau_{h}) is defined in ( A.3 ). Based on the law of total probability, we have

𝒟 ( z | 𝚙𝚝 h t ) = z ( 𝚙𝚝 h t ) 𝒫 𝒵 ( z ) z ~ 𝒵 z ~ ( 𝚙𝚝 h t ) 𝒫 𝒵 ( z ~ ) z ( 𝚙𝚝 h t ) z ( 𝚙𝚝 h t ) 𝒫 𝒵 ( z ) 𝒫 𝒵 ( z ) . subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝒫 𝒵 superscript 𝑧 subscript ~ 𝑧 𝒵 subscript ~ 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝒫 𝒵 ~ 𝑧 subscript superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝒫 𝒵 superscript 𝑧 subscript 𝒫 𝒵 𝑧 \mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})=\frac{\mathbb{P}_{z^{\prime}}(\mathtt{pt}_{h}^{t})\cdot\mathcal{P}_{\mathcal{Z}}(z^{\prime})}{\sum_{\tilde{z}\in\mathcal{Z}}\mathbb{P}_{\tilde{z}}(\mathtt{pt}_{h}^{t})\cdot\mathcal{P}_{\mathcal{Z}}(\tilde{z})}\leq\frac{\mathbb{P}_{z^{\prime}}(\mathtt{pt}_{h}^{t})}{\mathbb{P}_{z}(\mathtt{pt}_{h}^{t})}\cdot\frac{\mathcal{P}_{\mathcal{Z}}(z^{\prime})}{\mathcal{P}_{\mathcal{Z}}(z)}. (D.18)

Let 2 subscript 2 \mathcal{E}_{2} be the event that Lemma D.1 holds. Based on ( D.18 ), ( D.17 ) and conditioned on event 2 subscript 2 \mathcal{E}_{2} , it holds that

z z 𝒟 ( z | 𝚙𝚝 h t ) min { z z z ( 𝚙𝚝 h t ) z ( 𝚙𝚝 h t ) 𝒫 𝒵 ( z ) 𝒫 𝒵 ( z ) , 1 } subscript superscript 𝑧 𝑧 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript superscript 𝑧 𝑧 subscript superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝒫 𝒵 superscript 𝑧 subscript 𝒫 𝒵 𝑧 1 \displaystyle\sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\leq\min\left\{\sum_{z^{\prime}\neq z}\frac{\mathbb{P}_{z^{\prime}}(\mathtt{pt}_{h}^{t})}{\mathbb{P}_{z}(\mathtt{pt}_{h}^{t})}\cdot\frac{\mathcal{P}_{\mathcal{Z}}(z^{\prime})}{\mathcal{P}_{\mathcal{Z}}(z)},1\right\}
min { c 𝒵 z z exp ( L h , t 𝚎𝚡𝚙 ( z ) + L h , t 𝙻𝙻𝙼 ( z ) ) , 1 } absent subscript 𝑐 𝒵 subscript superscript 𝑧 𝑧 superscript subscript 𝐿 𝑡 𝚎𝚡𝚙 superscript 𝑧 superscript subscript 𝐿 𝑡 𝙻𝙻𝙼 superscript 𝑧 1 \displaystyle\leq\min\left\{c_{\mathcal{Z}}\sum_{z^{\prime}\neq z}\exp\left(L_{h,t}^{\mathtt{exp}}(z^{\prime})+L_{h,t}^{\mathtt{LLM}}(z^{\prime})\right),1\right\}
min { c 𝒵 z z exp ( t H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 2 η | 𝒳 𝚎𝚡𝚙 t | + 8 log ( | 𝒵 | / δ ) + 2 η ) , 1 } absent subscript 𝑐 𝒵 subscript superscript 𝑧 𝑧 𝑡 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 2 𝜂 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 8 𝒵 𝛿 2 𝜂 1 \displaystyle\leq\min\left\{c_{\mathcal{Z}}\sum_{z^{\prime}\neq z}\exp\Big{(}t\cdot H\lambda^{-1}_{R}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}-2\eta|\mathcal{X}^{t}_{\mathtt{exp}}|+8\log(|\mathcal{Z}|/\delta)+2\eta\Big{)},1\right\}
min { c 𝒵 z z exp ( ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) t + 8 log ( | 𝒵 | / δ ) + 2 η ) , 1 } absent subscript 𝑐 𝒵 subscript superscript 𝑧 𝑧 𝜂 italic-ϵ 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 𝑡 8 𝒵 𝛿 2 𝜂 1 \displaystyle\leq\min\left\{c_{\mathcal{Z}}\sum_{z^{\prime}\neq z}\exp\Big{(}-\left(\eta\epsilon-H\lambda^{-1}_{R}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)t+8\log(|\mathcal{Z}|/\delta)+2\eta\Big{)},1\right\}
min { c 𝒵 exp ( ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) t + 9 log ( | 𝒵 | / δ ) + 2 η ) , 1 } absent subscript 𝑐 𝒵 𝜂 italic-ϵ 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 𝑡 9 𝒵 𝛿 2 𝜂 1 \displaystyle\leq\min\left\{c_{\mathcal{Z}}\cdot\exp\Big{(}-\left(\eta\epsilon-H\lambda^{-1}_{R}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)t+9\log(|\mathcal{Z}|/\delta)+2\eta\Big{)},1\right\} (D.19)

for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , where the second inequality follows Assumption 4.5 . Here, we suppose that | 𝒳 𝚎𝚡𝚙 t | / t = ϵ subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝑡 italic-ϵ |\mathcal{X}^{t}_{\mathtt{exp}}|/t=\epsilon for simplicity, which is attainable if we explore at a fixed fraction during episodes. Assume that η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 𝜂 italic-ϵ 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 \eta\epsilon\geq H\lambda^{-1}_{R}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2} holds temporarily. Following ( D.19 ) and condition on event 2 subscript 2 \mathcal{E}_{2} , there exists a large constant c 0 > 0 subscript 𝑐 0 0 c_{0}>0 such that

t = 1 T superscript subscript 𝑡 1 𝑇 \displaystyle\sum_{t=1}^{T} h = 1 H z z 𝒟 ( z | 𝚙𝚝 h t ) c 0 H log ( c 𝒵 | 𝒵 | / δ ) ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) 1 , superscript subscript 1 𝐻 subscript superscript 𝑧 𝑧 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript 𝑐 0 𝐻 subscript 𝑐 𝒵 𝒵 𝛿 superscript 𝜂 italic-ϵ 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 1 \displaystyle\sum_{h=1}^{H}\sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\leq c_{0}\cdot H\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot\left(\eta\epsilon-H\lambda^{-1}_{R}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)^{-1}, (D.20)

where we use the fact that there exists constant c 0 > 0 subscript 𝑐 0 0 c_{0}>0 such that t = 1 T min { c 3 exp ( c 1 t + c 2 ) , 1 } c 0 c 1 1 ( c 2 + log c 3 ) superscript subscript 𝑡 1 𝑇 subscript 𝑐 3 subscript 𝑐 1 𝑡 subscript 𝑐 2 1 subscript 𝑐 0 superscript subscript 𝑐 1 1 subscript 𝑐 2 subscript 𝑐 3 \sum_{t=1}^{T}\min\{c_{3}\exp(-c_{1}t+c_{2}),1\}\leq c_{0}\cdot c_{1}^{-1}(c_{2}+\log c_{3}) for c 1 1 subscript 𝑐 1 1 c_{1}\leq 1 . Furthermore, based on ( D.20 ), we can show that

t = 1 T superscript subscript 𝑡 1 𝑇 \displaystyle\sum_{t=1}^{T} h = 1 H 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ z z 𝒟 ( z | 𝚙𝚝 h t ) ] superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript ^ 𝑧 subscript ^ 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝜏 𝑡 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑡 delimited-[] subscript superscript 𝑧 𝑧 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[\sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\right]
t = 1 T h = 1 H z z 𝔼 t i = 1 t 1 ^ z π ^ i 𝔼 τ h t ^ z π ^ t [ 𝒟 ( z | 𝚙𝚝 h t ) 𝟙 ( 2 holds ) ] + 2 H T δ absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript superscript 𝑧 𝑧 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript ^ 𝑧 subscript ^ 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝜏 𝑡 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑡 delimited-[] subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 1 subscript 2 holds 2 𝐻 𝑇 𝛿 \displaystyle\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\sum_{z^{\prime}\neq z}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{t}}}\left[\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\operatorname{\mathds{1}}\left(\mathcal{E}_{2}\text{~{}holds}\right)\right]+2HT\delta
c 0 H log ( c 𝒵 | 𝒵 | / δ ) ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) 1 + 2 H T δ . absent subscript 𝑐 0 𝐻 subscript 𝑐 𝒵 𝒵 𝛿 superscript 𝜂 italic-ϵ 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 1 2 𝐻 𝑇 𝛿 \displaystyle\leq c_{0}\cdot H\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot\left(\eta\epsilon-H\lambda^{-1}_{R}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)^{-1}+2HT\delta. (D.21)

Combine ( D.14 ), ( D.19 ), ( D.16 ) and ( D.21 ), it holds that

(iv ) \displaystyle\textbf{\small(iv}) c 0 H 2 log ( c 𝒵 | 𝒵 | / δ ) ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) 1 (v ) \displaystyle\leq\underbrace{c_{0}\cdot H^{2}\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot\left(\eta\epsilon-H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)^{-1}}_{\textbf{(v})}
+ H T η 1 ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) (vi ) + λ S H 2 T Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) \displaystyle\qquad+\underbrace{HT\eta^{-1}\left(\eta\epsilon-H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)}_{\textbf{(vi})}+\lambda_{S}H^{2}T\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta)
+ H 2 T ( η λ R ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 + 2 H T δ , superscript 𝐻 2 𝑇 superscript 𝜂 subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 2 𝐻 𝑇 𝛿 \displaystyle\qquad+H^{2}T(\eta\lambda_{R})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}+2HT\delta, (D.22)

If we explore with probability ϵ = H ( η λ R ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 + ( H log ( c 𝒵 | 𝒵 | / δ ) / T η ) 1 / 2 italic-ϵ 𝐻 superscript 𝜂 subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 superscript 𝐻 subscript 𝑐 𝒵 𝒵 𝛿 𝑇 𝜂 1 2 \epsilon=H(\eta\lambda_{R})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}+(H\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right)/T\eta)^{1/2} , which satisfies the condition that η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 𝜂 italic-ϵ 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 \eta\epsilon\geq H\lambda^{-1}_{R}\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2} assumed in ( D.19 ), then we have

(v ) + (vi ) 𝒪 ( H 3 2 log ( c 𝒵 | 𝒵 | / δ ) T / η ) . \displaystyle\textbf{\small(v})+\textbf{\small(vi})\leq\mathcal{O}\left(H^{\frac{3}{2}}\sqrt{\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot T/\eta}\right). (D.23)

Step 3. Conclude the Proof based on Step 1 and Step 2.
Combine ( D.10 ), ( D.12 ), ( D.22 ) and ( D.23 ), the regret under the practical setting follows

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) (i ) + (iii ) + (iv ) + H T ( 1 fails ) \displaystyle\leq\textbf{\small(i})+\textbf{\small(iii})+\textbf{\small(iv})+HT\cdot\mathbb{P}(\mathcal{E}_{1}\text{~{}fails})
= 𝒪 ( H 3 2 log ( c 𝒵 | 𝒵 | / δ ) T / η Planning error + H 2 T Δ p ( N p , T p , H , δ , ξ ) ) Pretraining error + 4 H T δ , \displaystyle=\mathcal{O}\underbrace{\Big{(}H^{\frac{3}{2}}\sqrt{\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot T/\eta}}_{\displaystyle\text{Planning error}}+\underbrace{H^{2}T\cdot\Delta_{\rm p}(N_{\rm p},T_{\rm p},H,\delta,\xi)\Big{)}}_{\displaystyle\text{Pretraining error}}+4HT\delta, (D.24)

where the cumulative pretraining error of the imperfectly pretrained PAR system follows

Δ p subscript Δ p \displaystyle\Delta_{\rm p} ( N p , T p , H , δ , ξ ) = ( η λ R ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 𝜉 superscript 𝜂 subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 \displaystyle(N_{\rm p},T_{\rm p},H,\delta,\xi)=(\eta\lambda_{R})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}
+ 2 λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) + λ S Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) . 2 superscript subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 subscript 𝜆 𝑆 subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \displaystyle+2\lambda_{R}^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)+\lambda_{S}\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta).

Here, ξ = ( η , λ S , λ R ) 𝜉 𝜂 subscript 𝜆 𝑆 subscript 𝜆 𝑅 \xi=(\eta,\lambda_{S},\lambda_{R}) denotes the set of distinguishability and coverage coefficients in Definition 4.4 and Assumption 5.6 , and Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta) and Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta) are pretraining errors defined in Theorem 2 and Theorem 5.5 . By taking δ = 1 / T 𝛿 1 𝑇 \delta=1/\sqrt{T} , we complete the proof of Theorem 5.7 . \Box

D.3 Proof of Lemma D.1

In this subsection, we provide a detailed examination of posterior concentration when there exists a mismatch between the ground-truth environment and the pretrained environment.

Lemma D.1 .

Suppose that Assumption 4.5 and Theorem 5.5 hold. For all ( z , h , t ) 𝒵 × [ H ] × [ T ] superscript 𝑧 𝑡 𝒵 delimited-[] 𝐻 delimited-[] 𝑇 (z^{\prime},h,t)\in\mathcal{Z}\times[H]\times[T] , with probability at least 1 2 δ 1 2 𝛿 1-2\delta , it holds that

(i). L h , t 𝙻𝙻𝙼 ( z ) ( t | 𝒳 𝚎𝚡𝚙 t | ) H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 + 4 log ( | 𝒵 | / δ ) , (i). superscript subscript 𝐿 𝑡 𝙻𝙻𝙼 superscript 𝑧 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 4 𝒵 𝛿 \displaystyle\qquad\text{(i).}\ L_{h,t}^{\mathtt{LLM}}(z^{\prime})\leq\left(t-|\mathcal{X}^{t}_{\mathtt{exp}}|\right)H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}+4\log(|\mathcal{Z}|/\delta),
(ii). L h , t 𝚎𝚡𝚙 ( z ) | 𝒳 𝚎𝚡𝚙 t | H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 + 4 log ( | 𝒵 | / δ ) 2 η | 𝒳 𝚎𝚡𝚙 t | + 2 η , (ii). superscript subscript 𝐿 𝑡 𝚎𝚡𝚙 superscript 𝑧 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 4 𝒵 𝛿 2 𝜂 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 2 𝜂 \displaystyle\qquad\text{(ii).}\ L_{h,t}^{\mathtt{exp}}(z^{\prime})\leq|\mathcal{X}^{t}_{\mathtt{exp}}|H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}+4\log(|\mathcal{Z}|/\delta)-2\eta\cdot|\mathcal{X}^{t}_{\mathtt{exp}}|+2\eta,

where L h , t 𝙻𝙻𝙼 ( z ) superscript subscript 𝐿 𝑡 𝙻𝙻𝙼 superscript 𝑧 L_{h,t}^{\mathtt{LLM}}(z^{\prime}) and L h , t 𝚎𝚡𝚙 ( z ) superscript subscript 𝐿 𝑡 𝚎𝚡𝚙 superscript 𝑧 L_{h,t}^{\mathtt{exp}}(z^{\prime}) are the information gain defined in ( D.17 ).

Proof of Lemma D.1 . Let 𝔉 t subscript 𝔉 𝑡 \mathfrak{F}_{t} be the filtration induced by { ω i , τ H i } i < t { 𝟙 ( π i = π exp ) } i [ t ] subscript superscript 𝜔 𝑖 superscript subscript 𝜏 𝐻 𝑖 𝑖 𝑡 subscript 1 superscript 𝜋 𝑖 subscript 𝜋 exp 𝑖 delimited-[] 𝑡 \{\omega^{i},\tau_{H}^{i}\}_{i<t}\cup\{\operatorname{\mathds{1}}(\pi^{i}=\pi_{\texttt{exp}})\}_{i\in[t]} . Consider a fixed tuple ( z , h , t ) 𝒵 × [ H ] × [ T ] superscript 𝑧 𝑡 𝒵 delimited-[] 𝐻 delimited-[] 𝑇 (z^{\prime},h,t)\in\mathcal{Z}\times[H]\times[T] , it holds that

^ z ( L h , t 𝙻𝙻𝙼 ( z ) β h , t 𝙻𝙻𝙼 ) inf λ 0 𝔼 𝔉 1 : t [ exp ( λ ( L h , t 𝙻𝙻𝙼 ( z ) β h , t 𝙻𝙻𝙼 ) ) ] subscript ^ 𝑧 superscript subscript 𝐿 𝑡 𝙻𝙻𝙼 superscript 𝑧 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 𝜆 0 infimum subscript 𝔼 subscript 𝔉 : 1 𝑡 delimited-[] 𝜆 superscript subscript 𝐿 𝑡 𝙻𝙻𝙼 superscript 𝑧 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 \displaystyle\mathbb{\widehat{P}}_{z}(L_{h,t}^{\mathtt{LLM}}(z^{\prime})\geq\beta_{h,t}^{\mathtt{LLM}})\leq\underset{\lambda\geq 0}{\inf}\ \mathbb{E}_{\mathfrak{F}_{1:t}}\left[\exp(\lambda\cdot(L_{h,t}^{\mathtt{LLM}}(z^{\prime})-\beta_{h,t}^{\mathtt{LLM}}))\right]
= inf λ 0 𝔼 i [ t ] \ 𝒳 𝚎𝚡𝚙 t ^ z π ^ i [ exp ( i [ t ] \ 𝒳 𝚎𝚡𝚙 t λ log ( z π ^ i ( τ ˘ h / t i ) z π ^ i ( τ ˘ h / t i ) ) λ β h , t 𝙻𝙻𝙼 ) ] absent 𝜆 0 infimum subscript 𝔼 subscript tensor-product 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 superscript subscript ^ 𝑧 subscript ^ 𝜋 𝑖 delimited-[] subscript 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝜆 superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 𝜆 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 \displaystyle\quad=\underset{\lambda\geq 0}{\inf}\ \mathbb{E}_{\bigotimes_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\mathbb{\widehat{P}}_{z}^{\widehat{\pi}_{i}}}\left[\exp\left(\sum_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\lambda\cdot\log\left(\frac{\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}\right)-\lambda\cdot\beta_{h,t}^{\mathtt{LLM}}\right)\right]
= inf λ 0 i [ t ] \ 𝒳 𝚎𝚡𝚙 t 𝔼 z π ^ i [ ( z π ^ i ( τ ˘ h / t i ) z π ^ i ( τ ˘ h / t i ) ) λ ^ z π ^ i ( τ ˘ h / t i ) z π ^ i ( τ ˘ h / t i ) ] exp ( λ β h , t 𝙻𝙻𝙼 ) absent 𝜆 0 infimum subscript product 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 subscript 𝔼 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 delimited-[] superscript superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 𝜆 superscript subscript ^ superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 𝜆 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 \displaystyle\quad=\underset{\lambda\geq 0}{\inf}\ \prod_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\mathbb{E}_{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}}\left[\left(\frac{\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}\right)^{\lambda}\cdot\frac{\mathbb{\widehat{P}}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}\right]\cdot\exp\left(-\lambda\cdot\beta_{h,t}^{\mathtt{LLM}}\right)
inf λ 0 i [ t ] \ 𝒳 𝚎𝚡𝚙 t 𝔼 z π ^ i [ ( z π ^ i ( τ ˘ h / t i ) z π ^ i ( τ ˘ h / t i ) ) 2 λ ] 1 / 2 𝔼 z π ^ i [ ( ^ z π ^ i ( τ ˘ h / t i ) z π ^ i ( τ ˘ h / t i ) ) 2 ] 1 / 2 exp ( λ β h , t 𝙻𝙻𝙼 ) , absent 𝜆 0 infimum subscript product 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 subscript 𝔼 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript delimited-[] superscript superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 2 𝜆 1 2 subscript 𝔼 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript delimited-[] superscript superscript subscript ^ superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 2 1 2 𝜆 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 \displaystyle\quad\leq\underset{\lambda\geq 0}{\inf}\ \prod_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\mathbb{E}_{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}}\left[\left(\frac{\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}\right)^{2\lambda}\right]^{1/2}\mathbb{E}_{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}}\left[\left(\frac{\mathbb{\widehat{P}}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}\right)^{2}\right]^{1/2}\cdot\exp\left(-\lambda\cdot\beta_{h,t}^{\mathtt{LLM}}\right),

where the first inequality is a natural corollary to Lemma F.1 , and the last inequality follows the Cauchy-Swartz inequality. By taking λ = 1 4 𝜆 1 4 \lambda=\frac{1}{4} , for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , we have

𝔼 z π ^ i [ ( z π ^ i ( τ ˘ h / t i ) z π ^ i ( τ ˘ h / t i ) ) 1 / 2 ] 1 / 2 𝔼 z π ^ i [ ( ^ z π ^ i ( τ ˘ h / t i ) z π ^ i ( τ ˘ h / t i ) ) 2 ] 1 / 2 1 + χ 2 ( z π ^ i ( τ ˘ h / t i ) ^ z π ^ i ( τ ˘ h / t i ) ) . subscript 𝔼 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript delimited-[] superscript superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 1 2 1 2 subscript 𝔼 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript delimited-[] superscript superscript subscript ^ superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 2 1 2 1 superscript 𝜒 2 conditional superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 \displaystyle\mathbb{E}_{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}}\left[\left(\frac{\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}\right)^{1/2}\right]^{1/2}\mathbb{E}_{\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}}\left[\left(\frac{\mathbb{\widehat{P}}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}{\mathbb{P}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})}\right)^{2}\right]^{1/2}\leq\sqrt{1+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}}. (D.25)

Based on Theorem 5.5 and Assumption 4.5 , for any policy π Π 𝜋 Π \pi\in\Pi , it holds that

1 1 \displaystyle 1 + χ 2 ( z π ( τ h ) ^ z π ( τ h ) ) 1 + χ 2 ( z π ( τ h , s 1 : h ) ^ z π ( τ h , s 1 : h ) ) superscript 𝜒 2 conditional superscript subscript superscript 𝑧 𝜋 subscript 𝜏 superscript subscript ^ 𝑧 𝜋 subscript 𝜏 1 superscript 𝜒 2 conditional superscript subscript superscript 𝑧 𝜋 subscript 𝜏 subscript 𝑠 : 1 superscript subscript ^ 𝑧 𝜋 subscript 𝜏 subscript 𝑠 : 1 \displaystyle+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\pi}(\tau_{h})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\pi}(\tau_{h})\big{)}\leq 1+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\pi}(\tau_{h},s_{1:h})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\pi}(\tau_{h},s_{1:h})\big{)}
1 + χ 2 ( h = 1 h z π ( g h , s h + 1 | τ h , s h ) 𝕆 ( o h | s h ) h = 1 h z π ( g h , s h + 1 | τ h , s h ) 𝕆 γ ^ ( o h | s h ) ) \displaystyle\leq 1+\chi^{2}\left(\prod_{h^{\prime}=1}^{h}\mathbb{P}_{z^{\prime}}^{\pi}(g_{h},s_{h+1}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},s_{h})\cdot\mathbb{O}(o_{h}\hskip 1.42262pt|\hskip 1.42262pts_{h})\hskip 1.42262pt\|\hskip 1.42262pt\prod_{h^{\prime}=1}^{h}\mathbb{P}_{z^{\prime}}^{\pi}(g_{h},s_{h+1}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},s_{h})\cdot\mathbb{O}_{\widehat{\gamma}}(o_{h}\hskip 1.42262pt|\hskip 1.42262pts_{h})\right)
( 1 + max s 𝒮 { χ 2 ( 𝕆 ( | s ) 𝕆 γ ^ ( | s ) ) } ) H ( 1 + λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) H , \displaystyle\leq\big{(}1+\max_{s\in\mathcal{S}}\big{\{}\chi^{2}\big{(}\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{O}_{\widehat{\gamma}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts)\big{)}\big{\}}\big{)}^{H}\leq\left(1+\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)^{H}, (D.26)

where the first inequality follows data processing inequality and the second inequality arises from the tensorization (Theorem 7.32 and §7.12, Polyanskiy and Wu,, 2022 ) . To ensure that L h , t 𝙻𝙻𝙼 ( z ) β h , t 𝙻𝙻𝙼 superscript subscript 𝐿 𝑡 𝙻𝙻𝙼 superscript 𝑧 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 L_{h,t}^{\mathtt{LLM}}(z^{\prime})\leq\beta_{h,t}^{\mathtt{LLM}} holds for all ( z , h , t ) 𝒵 × [ H ] × [ T ] superscript 𝑧 𝑡 𝒵 delimited-[] 𝐻 delimited-[] 𝑇 (z^{\prime},h,t)\in\mathcal{Z}\times[H]\times[T] with probability at least 1 δ 1 𝛿 1-\delta , we let

i [ t ] \ 𝒳 𝚎𝚡𝚙 t 1 + χ 2 ( z π ^ i ( τ ˘ h / t i ) ^ z π ^ i ( τ ˘ h / t i ) ) exp ( β h , t 𝙻𝙻𝙼 4 ) = δ | 𝒵 | , subscript product 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 1 superscript 𝜒 2 conditional superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 4 𝛿 𝒵 \prod_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\sqrt{1+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}}\cdot\exp\left(-\frac{\beta_{h,t}^{\mathtt{LLM}}}{4}\right)=\frac{\delta}{|\mathcal{Z}|},

with a union bound taken over 𝒵 𝒵 \mathcal{Z} , since Lemma F.1 has ensured the inequality holds for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] . Thus, the constant β h , t 𝙻𝙻𝙼 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 \beta_{h,t}^{\mathtt{LLM}} is then chosen as

β h , t 𝙻𝙻𝙼 superscript subscript 𝛽 𝑡 𝙻𝙻𝙼 \displaystyle\beta_{h,t}^{\mathtt{LLM}} = 2 i [ t ] \ 𝒳 𝚎𝚡𝚙 t log ( 1 + χ 2 ( z π ^ i ( τ ˘ h / t i ) ^ z π ^ i ( τ ˘ h / t i ) ) ) + 4 log ( | 𝒵 | / δ ) absent 2 subscript 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 1 superscript 𝜒 2 conditional superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 4 𝒵 𝛿 \displaystyle=2\sum_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\log\left(1+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\right)+4\log(|\mathcal{Z}|/\delta)
( t | 𝒳 𝚎𝚡𝚙 t | ) H log ( 1 + λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) + 4 log ( | 𝒵 | / δ ) absent 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝐻 1 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 4 𝒵 𝛿 \displaystyle\leq\left(t-|\mathcal{X}^{t}_{\mathtt{exp}}|\right)\cdot H\log\left(1+\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)+4\log(|\mathcal{Z}|/\delta)
( t | 𝒳 𝚎𝚡𝚙 t | ) H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 + 4 log ( | 𝒵 | / δ ) , absent 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 4 𝒵 𝛿 \displaystyle\leq\left(t-|\mathcal{X}^{t}_{\mathtt{exp}}|\right)\cdot H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}+4\log(|\mathcal{Z}|/\delta),

which is based on ( D.25 ), ( D.26 ) by taking a union bound over 𝒵 𝒵 \mathcal{Z} , and the last inequality results from log ( 1 + x ) x 1 𝑥 𝑥 \log(1+x)\leq x for all x 0 𝑥 0 x\geq 0 . Similarly, for the exploration episodes, we let

^ z subscript ^ 𝑧 \displaystyle\mathbb{\widehat{P}}_{z} ( L h , t 𝚎𝚡𝚙 ( z ) β h , t 𝚎𝚡𝚙 ) inf λ 0 𝔼 [ exp ( λ ( L h , t 𝚎𝚡𝚙 β h , t 𝚎𝚡𝚙 ) ) ] superscript subscript 𝐿 𝑡 𝚎𝚡𝚙 superscript 𝑧 superscript subscript 𝛽 𝑡 𝚎𝚡𝚙 𝜆 0 infimum 𝔼 delimited-[] 𝜆 superscript subscript 𝐿 𝑡 𝚎𝚡𝚙 superscript subscript 𝛽 𝑡 𝚎𝚡𝚙 \displaystyle(L_{h,t}^{\mathtt{exp}}(z^{\prime})\geq\beta_{h,t}^{\mathtt{exp}})\leq\underset{\lambda\geq 0}{\inf}\ \mathbb{E}\left[\exp(\lambda\cdot(L_{h,t}^{\mathtt{exp}}-\beta_{h,t}^{\mathtt{exp}}))\right]
i 𝒳 𝚎𝚡𝚙 t 1 D H 2 ( z π ^ i ( τ ˘ h / t i ) , z π ^ i ( τ ˘ h / t i ) ) 1 + χ 2 ( z π ^ i ( τ ˘ h / t i ) ^ z π ^ i ( τ ˘ h / t i ) ) exp ( 1 4 β h , t 𝚎𝚡𝚙 ) . absent subscript product 𝑖 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 1 subscript superscript 𝐷 2 H subscript superscript superscript ^ 𝜋 𝑖 superscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 1 superscript 𝜒 2 conditional superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 1 4 superscript subscript 𝛽 𝑡 𝚎𝚡𝚙 \displaystyle\quad\leq\prod_{i\in\mathcal{X}^{t}_{\mathtt{exp}}}\sqrt{1-D^{2}_{\rm H}\big{(}\mathbb{P}^{\widehat{\pi}^{i}}_{z^{\prime}}(\breve{\tau}_{h/t}^{i}),\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}}\cdot\sqrt{1+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}}\cdot\exp\left(-\frac{1}{4}\beta_{h,t}^{\mathtt{exp}}\right).

Furthermore, based on Definition 4.4 , the expolration episodes satisfies that

i 𝒳 𝚎𝚡𝚙 t subscript 𝑖 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 \displaystyle\sum_{i\in\mathcal{X}^{t}_{\mathtt{exp}}} D H 2 ( z π ^ i ( τ ˘ h / t i ) , z π ^ i ( τ ˘ h / t i ) ) i 𝒳 𝚎𝚡𝚙 t 1 D H 2 ( z π ^ i ( τ H ) , z π ^ i ( τ H ) ) η | 𝒳 𝚎𝚡𝚙 t 1 | . subscript superscript 𝐷 2 H superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 subscript 𝑖 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 subscript superscript 𝐷 2 H superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 subscript 𝜏 𝐻 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 subscript 𝜏 𝐻 𝜂 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 \displaystyle D^{2}_{\rm H}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{P}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\geq\sum_{i\in\mathcal{X}^{t-1}_{\mathtt{exp}}}D^{2}_{\rm H}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\tau_{H}),\mathbb{P}_{z}^{\widehat{\pi}^{i}}(\tau_{H})\big{)}\geq{\eta}\cdot|\mathcal{X}^{t-1}_{\mathtt{exp}}|. (D.27)

To ensure that L h , t 𝚎𝚡𝚙 ( z ) β h , t 𝚎𝚡𝚙 superscript subscript 𝐿 𝑡 𝚎𝚡𝚙 superscript 𝑧 superscript subscript 𝛽 𝑡 𝚎𝚡𝚙 L_{h,t}^{\mathtt{exp}}(z^{\prime})\leq\beta_{h,t}^{\mathtt{exp}} holds for all ( z , h , t ) 𝒵 × [ H ] × [ T ] superscript 𝑧 𝑡 𝒵 delimited-[] 𝐻 delimited-[] 𝑇 (z^{\prime},h,t)\in\mathcal{Z}\times[H]\times[T] with high probability, we take

i [ t ] \ 𝒳 𝚎𝚡𝚙 t 1 D H 2 ( z π ^ i ( τ ˘ h / t i ) , z π ^ i ( τ ˘ h / t i ) ) 1 + χ 2 ( z π ^ i ( τ ˘ h / t i ) ^ z π ^ i ( τ ˘ h / t i ) ) exp ( β h , t 𝚎𝚡𝚙 4 ) = δ | 𝒵 | , subscript product 𝑖 \ delimited-[] 𝑡 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 1 subscript superscript 𝐷 2 H subscript superscript superscript ^ 𝜋 𝑖 superscript 𝑧 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 1 superscript 𝜒 2 conditional superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝛽 𝑡 𝚎𝚡𝚙 4 𝛿 𝒵 \prod_{i\in[t]\backslash\mathcal{X}^{t}_{\mathtt{exp}}}\sqrt{1-D^{2}_{\rm H}\big{(}\mathbb{P}^{\widehat{\pi}^{i}}_{z^{\prime}}(\breve{\tau}_{h/t}^{i}),\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}}\cdot\sqrt{1+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}}\cdot\exp\left(-\frac{\beta_{h,t}^{\mathtt{exp}}}{4}\right)=\frac{\delta}{|\mathcal{Z}|},

with a union bound taken over 𝒵 𝒵 \mathcal{Z} , and thus the constant β h , t 𝚎𝚡𝚙 superscript subscript 𝛽 𝑡 𝚎𝚡𝚙 \beta_{h,t}^{\mathtt{exp}} is chosen as

β h , t 𝚎𝚡𝚙 superscript subscript 𝛽 𝑡 𝚎𝚡𝚙 \displaystyle\beta_{h,t}^{\mathtt{exp}} = 2 i 𝒳 𝚎𝚡𝚙 t log ( 1 D H 2 ( z π ^ i ( τ ˘ h / t i ) , z π ^ i ( τ ˘ h / t i ) ) ) absent 2 subscript 𝑖 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 1 subscript superscript 𝐷 2 H superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 \displaystyle=2\sum_{i\in\mathcal{X}^{t}_{\mathtt{exp}}}\log\left(1-D^{2}_{\rm H}\left(\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i}),\mathbb{{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\right)\right)
+ 2 i 𝒳 𝚎𝚡𝚙 t log ( 1 + χ 2 ( z π ^ i ( τ ˘ h / t i ) ^ z π ^ i ( τ ˘ h / t i ) ) ) + 4 log ( | 𝒵 | / δ ) 2 subscript 𝑖 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 1 superscript 𝜒 2 conditional superscript subscript superscript 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 superscript subscript ^ 𝑧 superscript ^ 𝜋 𝑖 superscript subscript ˘ 𝜏 𝑡 𝑖 4 𝒵 𝛿 \displaystyle\quad+2\sum_{i\in\mathcal{X}^{t}_{\mathtt{exp}}}\log\left(1+\chi^{2}\big{(}\mathbb{P}_{z^{\prime}}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{\widehat{P}}_{z}^{\widehat{\pi}^{i}}(\breve{\tau}_{h/t}^{i})\big{)}\right)+4\log(|\mathcal{Z}|/\delta)
| 𝒳 𝚎𝚡𝚙 t | H log ( 1 + λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) + 4 log ( | 𝒵 | / δ ) 2 η | 𝒳 𝚎𝚡𝚙 t 1 | absent subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝐻 1 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 4 𝒵 𝛿 2 𝜂 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 \displaystyle\leq|\mathcal{X}^{t}_{\mathtt{exp}}|\cdot H\log\left(1+\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)+4\log(|\mathcal{Z}|/\delta)-2\eta\cdot|\mathcal{X}^{t-1}_{\mathtt{exp}}|
| 𝒳 𝚎𝚡𝚙 t | H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 + 4 log ( | 𝒵 | / δ ) 2 η ( | 𝒳 𝚎𝚡𝚙 t | 1 ) , absent subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 4 𝒵 𝛿 2 𝜂 subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 1 \displaystyle\leq|\mathcal{X}^{t}_{\mathtt{exp}}|\cdot H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}+4\log(|\mathcal{Z}|/\delta)-2\eta\cdot(|\mathcal{X}^{t}_{\mathtt{exp}}|-1),

where the first inequality results from ( D.26 ), ( D.27 ) and facts that log ( 1 x ) x 1 𝑥 𝑥 \log(1-x)\leq-x for all x 1 𝑥 1 x\leq 1 and log ( 1 + x ) x 1 𝑥 𝑥 \log(1+x)\leq x for all x 0 𝑥 0 x\geq 0 , and then we complete the proof of Lemma D.1 . \Box

D.4 Proof of Lemma D.2

Lemma D.2 (Learning Target of Contrastive Loss) .

For any observation-state pair ( o , s ) 𝒪 × 𝒮 𝑜 𝑠 𝒪 𝒮 (o,s)\in\mathcal{O}\times\mathcal{S} sampled from the contrastive collection process, the learning target is f ( o , s ) = 𝕆 ( o | s ) / 𝒫 ( o ) . superscript 𝑓 𝑜 𝑠 𝕆 conditional 𝑜 𝑠 superscript 𝒫 𝑜 f^{*}(o,s)={\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)}/{\mathcal{P}^{-}(o)}.

Proof of Lemma D.2 . For any ( o , s ) 𝒪 × 𝒮 𝑜 𝑠 𝒪 𝒮 (o,s)\in\mathcal{O}\times\mathcal{S} , the posterior probability of label y 𝑦 y follows that

𝔻 ( y | o , s ) := 𝒞 ( y | o , s ) = 𝒞 ( o | s , y ) 𝒞 ( s | y ) y { 0 , 1 } 𝒞 ( o | s , y ) 𝒞 ( s | y ) , assign 𝔻 conditional 𝑦 𝑜 𝑠 subscript 𝒞 conditional 𝑦 𝑜 𝑠 subscript 𝒞 conditional 𝑜 𝑠 𝑦 subscript 𝒞 conditional 𝑠 𝑦 subscript 𝑦 0 1 subscript 𝒞 conditional 𝑜 𝑠 𝑦 subscript 𝒞 conditional 𝑠 𝑦 \displaystyle\mathbb{D}(y\hskip 1.42262pt|\hskip 1.42262pto,s):=\mathbb{P}_{\mathcal{C}}(y\hskip 1.42262pt|\hskip 1.42262pto,s)=\frac{\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y)\cdot\mathbb{P}_{\mathcal{C}}(s\hskip 1.42262pt|\hskip 1.42262pty)}{\sum_{y\in\{0,1\}}\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y)\cdot\mathbb{P}_{\mathcal{C}}(s\hskip 1.42262pt|\hskip 1.42262pty)},

where the equation follows the Baye’s Theorem and 𝒞 ( y = 0 ) = 𝒞 ( y = 1 ) = 1 / 2 subscript 𝒞 𝑦 0 subscript 𝒞 𝑦 1 1 2 \mathbb{P}_{\mathcal{C}}(y=0)=\mathbb{P}_{\mathcal{C}}(y=1)=1/2 . Moreover, the contrastive data collection process in § 3.2 indicates that

𝒞 ( | s , y = 0 ) = 𝕆 ( | s ) , 𝒞 ( | s , y = 1 ) = 𝒫 ( ) , \mathbb{P}_{\mathcal{C}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts,y=0)=\mathbb{O}(\cdot\hskip 1.42262pt|\hskip 1.42262pts),\quad\mathbb{P}_{\mathcal{C}}(\cdot\hskip 1.42262pt|\hskip 1.42262pts,y=1)=\mathcal{P}^{-}(\cdot), (D.28)

and data are labeled independent of data itself, such that 𝒞 ( s | y ) = 𝒞 ( s ) subscript 𝒞 conditional 𝑠 𝑦 subscript 𝒞 𝑠 \mathbb{P}_{\mathcal{C}}(s\hskip 1.42262pt|\hskip 1.42262pty)=\mathbb{P}_{\mathcal{C}}(s) . Thus, 𝒞 ( y | o , s ) = 𝒞 ( o | s , y ) / ( 𝒫 ( o ) + 𝕆 ( o | s ) ) subscript 𝒞 conditional 𝑦 𝑜 𝑠 subscript 𝒞 conditional 𝑜 𝑠 𝑦 superscript 𝒫 𝑜 𝕆 conditional 𝑜 𝑠 \mathbb{P}_{\mathcal{C}}(y\hskip 1.42262pt|\hskip 1.42262pto,s)=\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y)/(\mathcal{P}^{-}(o)+\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)) . Recall that the population risk is

CT ( γ ; 𝒟 𝚁𝚎𝚙 ) = 𝔼 [ D KL ( 𝔻 γ ( | o , s ) 𝔻 ( | o , s ) ) + Ent ( 𝔻 ( | o , s ) ) ] . \mathcal{R}_{\rm CT}(\gamma;\mathcal{D}_{\mathtt{Rep}})=\mathbb{E}\left[D_{\rm KL}\left(\mathbb{D}_{\gamma}(\cdot|o,s)\hskip 1.42262pt\|\hskip 1.42262pt\mathbb{D}(\cdot|o,s)\right)+\mathrm{Ent}(\mathbb{D}(\cdot|o,s))\right].

As the minimum is attained at 𝔻 γ ( | o , s ) = 𝔻 ( | o , s ) \mathbb{D}_{\gamma}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s)=\mathbb{D}(\cdot\hskip 1.42262pt|\hskip 1.42262pto,s) . Following ( 5.1 ), the learning target follows

𝒞 ( o | s , y ) 𝒫 ( o ) + 𝕆 ( o | s ) = ( f ( o , s ) 1 + f ( o , s ) ) y ( 1 1 + f ( o , s ) ) 1 y . subscript 𝒞 conditional 𝑜 𝑠 𝑦 superscript 𝒫 𝑜 𝕆 conditional 𝑜 𝑠 superscript superscript 𝑓 𝑜 𝑠 1 superscript 𝑓 𝑜 𝑠 𝑦 superscript 1 1 superscript 𝑓 𝑜 𝑠 1 𝑦 \frac{\mathbb{P}_{\mathcal{C}}(o\hskip 1.42262pt|\hskip 1.42262pts,y)}{\mathcal{P}^{-}(o)+\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)}=\left(\frac{f^{*}(o,s)}{1+f^{*}(o,s)}\right)^{y}\left(\frac{1}{1+f^{*}(o,s)}\right)^{1-y}. (D.29)

By solving the equation in ( D.29 ), the learning target follows that f ( o , s ) = 𝕆 ( o | s ) / 𝒫 ( o ) superscript 𝑓 𝑜 𝑠 𝕆 conditional 𝑜 𝑠 superscript 𝒫 𝑜 f^{*}(o,s)=\mathbb{O}(o\hskip 1.42262pt|\hskip 1.42262pts)/\mathcal{P}^{-}(o) for the contrastive loss in ( 3.8 ), and then we conclude the proof of Lemma D.2 . \Box

Appendix E Proof for Section B : Extentions

E.1 Proof of Proposition B.1

Proof of Proposition B.1 .

Based on the law of total probability, it holds that

𝒟 ( o h | ( o , g ) 1 : h 1 , t ) subscript 𝒟 conditional subscript 𝑜 subscript 𝑜 𝑔 : 1 1 subscript 𝑡 \displaystyle\mathbb{P}_{\mathcal{D}}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h-1},\mathcal{H}_{t}\right) = z 𝒵 z ( o h | ( o , g ) 1 : h 1 ) 𝒟 ( z | ( o , g ) 1 : h 1 , t ) absent subscript 𝑧 𝒵 subscript 𝑧 conditional subscript 𝑜 subscript 𝑜 𝑔 : 1 1 subscript 𝒟 conditional 𝑧 subscript 𝑜 𝑔 : 1 1 subscript 𝑡 \displaystyle=\sum_{z\in\mathcal{Z}}\mathbb{P}_{z}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h-1}\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h-1},\mathcal{H}_{t}\right) (E.1)

Furthermore, based on Baye’s theorem, we have

𝒟 ( z | ( o , g ) 1 : h 1 , t ) = h = 1 h 2 z ( o h + 1 | ( o , g ) 1 : h ) h = 1 h 2 𝒟 ( o h + 1 | ( o , g ) 1 : h , t ) 𝒟 ( z | t ) , subscript 𝒟 conditional 𝑧 subscript 𝑜 𝑔 : 1 1 subscript 𝑡 superscript subscript product superscript 1 2 subscript 𝑧 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript superscript subscript product superscript 1 2 subscript 𝒟 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 subscript 𝒟 conditional 𝑧 subscript 𝑡 \displaystyle\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h-1},\mathcal{H}_{t}\right)=\frac{\prod_{h^{\prime}=1}^{h-2}\mathbb{P}_{z}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)}{\prod_{h^{\prime}=1}^{h-2}\mathbb{P}_{\mathcal{D}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)}\cdot\mathbb{P}_{\mathcal{D}}(z\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t}), (E.2)

Hence, ( E.1 ) and ( E.2 ) jointly indicates that

h = 1 h 1 𝒟 ( o h + 1 | ( o , g ) 1 : h , t ) superscript subscript product superscript 1 1 subscript 𝒟 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 \displaystyle\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{\mathcal{D}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right) = 𝒟 ( o h | ( o , g ) 1 : h 1 , t ) h = 1 h 2 𝒟 ( o h + 1 | ( o , g ) 1 : h , t ) absent subscript 𝒟 conditional subscript 𝑜 subscript 𝑜 𝑔 : 1 1 subscript 𝑡 superscript subscript product superscript 1 2 subscript 𝒟 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 \displaystyle=\mathbb{P}_{\mathcal{D}}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h-1},\mathcal{H}_{t}\right)\cdot\prod_{h^{\prime}=1}^{h-2}\mathbb{P}_{\mathcal{D}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)
= z 𝒵 z ( o h | ( o , g ) 1 : h 1 ) h = 1 h 2 z ( o h + 1 | ( o , g ) 1 : h ) 𝒟 ( z | t ) absent subscript 𝑧 𝒵 subscript 𝑧 conditional subscript 𝑜 subscript 𝑜 𝑔 : 1 1 superscript subscript product superscript 1 2 subscript 𝑧 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝒟 conditional 𝑧 subscript 𝑡 \displaystyle=\sum_{z\in\mathcal{Z}}\mathbb{P}_{z}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h-1}\right)\cdot\prod_{h^{\prime}=1}^{h-2}\mathbb{P}_{z}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)\cdot\mathbb{P}_{\mathcal{D}}(z\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t})
= z 𝒵 ( h = 1 h 1 z ( o h + 1 | ( o , g ) 1 : h ) ) 𝒟 ( z | t ) . absent subscript 𝑧 𝒵 superscript subscript product superscript 1 1 subscript 𝑧 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝒟 conditional 𝑧 subscript 𝑡 \displaystyle=\sum_{z\in\mathcal{Z}}\left(\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{z}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)\right)\cdot\mathbb{P}_{\mathcal{D}}(z\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t}). (E.3)

Following the definition of marginal distributions, it holds that

𝙻𝙻𝙼 t ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) superscript subscript 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 subscript 𝑜 1 𝐝𝐨 subscript 𝑔 : 1 1 \displaystyle\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right) = o 2 : h 1 h = 1 h 1 𝒟 ( o h + 1 | ( o , g ) 1 : h , t ) d o 2 : h 1 absent subscript subscript 𝑜 : 2 1 superscript subscript product superscript 1 1 subscript 𝒟 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 d subscript 𝑜 : 2 1 \displaystyle=\int_{o_{2:h-1}}\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{\mathcal{D}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right){\mathrm{d}}o_{2:h-1}
= z 𝒵 ( o 2 : h 1 h = 1 h 1 z ( o h + 1 | ( o , g ) 1 : h ) d o 2 : h 1 ) 𝒟 ( z | t ) absent subscript 𝑧 𝒵 subscript subscript 𝑜 : 2 1 superscript subscript product superscript 1 1 subscript 𝑧 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript d subscript 𝑜 : 2 1 subscript 𝒟 conditional 𝑧 subscript 𝑡 \displaystyle=\sum_{z\in\mathcal{Z}}\left(\int_{o_{2:h-1}}\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{z}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right){\mathrm{d}}o_{2:h-1}\right)\cdot\mathbb{P}_{\mathcal{D}}(z\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t})
= z 𝒵 z ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) 𝒟 ( z | t ) , absent subscript 𝑧 𝒵 subscript 𝑧 conditional subscript 𝑜 subscript 𝑜 1 𝐝𝐨 subscript 𝑔 : 1 1 subscript 𝒟 conditional 𝑧 subscript 𝑡 \displaystyle=\sum_{z\in\mathcal{Z}}\mathbb{P}_{z}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)\cdot\mathbb{P}_{\mathcal{D}}(z\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t}),

where the second equation follows ( E.3 ) and then we complete the proof of Proposition B.1 . ∎

E.2 Proof of Corollary B.3

Notations.

Denote ( 𝒥 , 𝒥 ^ ) 𝒥 ^ 𝒥 (\mathcal{J},\mathcal{\widehat{J}}) and ( π z , π ^ z ) superscript subscript 𝜋 𝑧 superscript subscript ^ 𝜋 𝑧 ({\pi}_{z}^{*},\widehat{\pi}_{z}^{*}) , and ( z , h , ^ z , h ) subscript 𝑧 subscript ^ 𝑧 (\mathbb{P}_{z,h},\mathbb{\widehat{P}}_{z,h}) as the value functions, optimal policies, and probability under the environment concerning the ground-truth 𝕆 𝕆 \mathbb{O} and the pretrained 𝕆 γ ^ subscript 𝕆 ^ 𝛾 \mathbb{O}_{\widehat{\gamma}} . Let ( 𝒥 ^ t , 𝙻𝙻𝙼 , π ^ 𝙻𝙻𝙼 t , ) subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 (\mathcal{\widehat{J}}_{t,\mathtt{LLM}},\widehat{\pi}_{\mathtt{LLM}}^{t,*}) be the value function of the environment simulated by pretrained 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} and its optimal policy; 𝒥 t , 𝙻𝙻𝙼 subscript 𝒥 𝑡 𝙻𝙻𝙼 \mathcal{J}_{t,\mathtt{LLM}} denote the value function of the environment simulated by perfect 𝙻𝙻𝙼 𝙻𝙻𝙼 \mathtt{LLM} ; ( 𝙻𝙻𝙼 t , ^ 𝙻𝙻𝙼 t ) superscript subscript 𝙻𝙻𝙼 𝑡 superscript subscript ^ 𝙻𝙻𝙼 𝑡 (\mathbb{P}_{\mathtt{LLM}}^{t},\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}) are the probability under environment simulated by perfect 𝙻𝙻𝙼 𝙻𝙻𝙼 \mathtt{LLM} or pretrained 𝙻𝙻𝙼 θ ^ subscript 𝙻𝙻𝙼 ^ 𝜃 \mathtt{LLM}_{\widehat{\theta}} .

Proof of Corollary B.3 .

Condition on the event 1 subscript 1 \mathcal{E}_{1} that both Theorem 2 and 5.5 hold, the regret under the practical setting can be decomposed as

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) t = 1 T 𝒥 ^ z ( π ^ z , ω t ) 𝒥 z ( π ^ z , ω t ) (i ) + t = 1 T 𝔼 t [ 𝒥 z ( π ^ z , ω t ) 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ z , ω t ) ] (ii ) \displaystyle\leq\underbrace{\sum_{t=1}^{T}\mathcal{\widehat{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})}_{\textbf{(i})}+\underbrace{\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{z}^{*},\omega^{t})\right]}_{\textbf{(ii})}
+ t = 1 T 𝔼 t [ 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ z , ω t ) 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ t , ω t ) ] (iii ) \displaystyle\quad+\underbrace{\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}^{t},\omega^{t})\right]}_{\textbf{(iii})}
+ t = 1 T 𝔼 t [ 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ t , ω t ) 𝒥 z ( π ^ t , ω t ) ] (iv ) + t = 1 T 𝔼 t [ 𝒥 z ( π ^ t , ω t ) 𝒥 ^ z ( π ^ t , ω t ) ] (v ) . \displaystyle\quad+\underbrace{\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}^{t},\omega^{t})-\mathcal{J}_{z}(\widehat{\pi}^{t},\omega^{t})\right]}_{\textbf{(iv})}+\underbrace{\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{J}_{z}(\widehat{\pi}^{t},\omega^{t})-\mathcal{\widehat{J}}_{z}(\widehat{\pi}^{t},\omega^{t})\right]}_{\textbf{(v})}. (E.4)

Step 1. Bound (i) and (v) with Translator’s Pretraining Error.
Similar to ( D.11 ) in the proof of Theorem 5.7 , it holds that

(i ) + (vi ) 2 H 2 T λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) , \displaystyle\textbf{\small(i})+\textbf{\small(vi})\leq 2H^{2}T\lambda_{R}^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta), (E.5)

following the pretraining error in Theorem 5.5 .
Step 2. Bound (iii) via Optimality in Planner’s Algorithm.
Recall that Planner conducts task planning via the mixture policy:

π h t ( | τ h t , ω t ) ( 1 ϵ ) π ^ h , 𝙻𝙻𝙼 t , ( | τ h t , ω t ) + ϵ π h , 𝚎𝚡𝚙 ( | τ h t ) , \pi_{h}^{t}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})\sim(1-\epsilon)\cdot\widehat{\pi}_{h,\mathtt{LLM}}^{t,*}(\cdot\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})+\epsilon\cdot\pi_{h,\mathtt{exp}}(\cdot|\tau_{h}^{t}), (E.6)

Following this, it holds that

(iii ) \displaystyle\textbf{\small(iii}) = t = 1 T 𝔼 t [ 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ z , ω t ) 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ 𝙻𝙻𝙼 t , , ω t ) ] + t = 1 T 𝔼 t [ 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ 𝙻𝙻𝙼 t , , ω t ) 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ t , ω t ) ] absent superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝑧 superscript 𝜔 𝑡 subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 superscript 𝜔 𝑡 superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 superscript 𝜔 𝑡 subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript ^ 𝜋 𝑡 superscript 𝜔 𝑡 \displaystyle=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{\mathtt{LLM}}^{t,*},\omega^{t})\right]+\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{\mathtt{LLM}}^{t,*},\omega^{t})-\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}^{t},\omega^{t})\right]
t = 1 T 𝔼 t [ 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ 𝙻𝙻𝙼 t , , ω t ) ( 1 ϵ ) 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ 𝙻𝙻𝙼 t , , ω t ) ϵ 𝒥 ^ t , 𝙻𝙻𝙼 ( π 𝚎𝚡𝚙 , ω t ) ] 2 H T ϵ , absent superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 superscript 𝜔 𝑡 1 italic-ϵ subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 superscript 𝜔 𝑡 italic-ϵ subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 subscript 𝜋 𝚎𝚡𝚙 superscript 𝜔 𝑡 2 𝐻 𝑇 italic-ϵ \displaystyle\leq\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{\mathtt{LLM}}^{t,*},\omega^{t})-(1-\epsilon)\cdot\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{\mathtt{LLM}}^{t,*},\omega^{t})-\epsilon\cdot\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\pi_{\mathtt{exp}},\omega^{t})\right]\leq 2HT\epsilon, (E.7)

where the the first inequality results from the optimality of π ^ 𝙻𝙻𝙼 t , superscript subscript ^ 𝜋 𝙻𝙻𝙼 𝑡 \widehat{\pi}_{\mathtt{LLM}}^{t,*} under simulated environment.
Step 3. Bound (ii) and (iv) with LLM’s Pretraining Error.
For any policy π Π 𝜋 Π \pi\in\Pi , given history t subscript 𝑡 \mathcal{H}_{t} , the performance difference follows

𝒥 ^ t , 𝙻𝙻𝙼 ( π , ω t ) 𝒥 z ( π , ω t ) subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 𝜋 superscript 𝜔 𝑡 subscript 𝒥 𝑧 𝜋 superscript 𝜔 𝑡 \displaystyle\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\pi,\omega^{t})-\mathcal{J}_{z}(\pi,\omega^{t}) = 𝒥 ^ t , 𝙻𝙻𝙼 ( π , ω t ) 𝒥 t , 𝙻𝙻𝙼 ( π , ω t ) + 𝒥 t , 𝙻𝙻𝙼 ( π , ω t ) 𝒥 z ( π , ω t ) absent subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 𝜋 superscript 𝜔 𝑡 subscript 𝒥 𝑡 𝙻𝙻𝙼 𝜋 superscript 𝜔 𝑡 subscript 𝒥 𝑡 𝙻𝙻𝙼 𝜋 superscript 𝜔 𝑡 subscript 𝒥 𝑧 𝜋 superscript 𝜔 𝑡 \displaystyle=\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\pi,\omega^{t})-\mathcal{J}_{t,\mathtt{LLM}}(\pi,\omega^{t})+\mathcal{J}_{t,\mathtt{LLM}}(\pi,\omega^{t})-\mathcal{J}_{z}(\pi,\omega^{t})
𝔼 [ h = 1 H o h ( ^ 𝙻𝙻𝙼 t ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) 𝙻𝙻𝙼 t ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) ) d o h ] (vi ) \displaystyle\leq\underbrace{\mathbb{E}\left[\sum_{h=1}^{H}\int_{o_{h}}\left(\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)-\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)\right){\mathrm{d}}o_{h}\right]}_{\textbf{(vi})}
+ sup g 1 : H 1 h = 1 H o h ( 𝙻𝙻𝙼 t ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) z ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) ) d o h (vii ) , \displaystyle\quad+\underbrace{\sup_{g_{1:H-1}}\sum_{h=1}^{H}\int_{o_{h}}\left(\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)-\mathbb{P}_{z}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)\right){\mathrm{d}}o_{h}}_{\textbf{(vii})},

where the inequality arises from r h 1 subscript norm subscript 𝑟 1 \|r_{h}\|_{\infty}\leq 1 depending solely on o h subscript 𝑜 o_{h} . Furthermore, we have

o h ^ 𝙻𝙻𝙼 t ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) 𝙻𝙻𝙼 t ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) d o h subscript subscript 𝑜 superscript subscript ^ 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 subscript 𝑜 1 𝐝𝐨 subscript 𝑔 : 1 1 superscript subscript 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 subscript 𝑜 1 𝐝𝐨 subscript 𝑔 : 1 1 d subscript 𝑜 \displaystyle\int_{o_{h}}\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)-\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right){\mathrm{d}}o_{h}
= o 2 : h ( h = 1 h 1 ^ 𝙻𝙻𝙼 t ( o h + 1 | ( o , g ) 1 : h ) h = 1 h 1 𝙻𝙻𝙼 t ( o h + 1 | ( o , g ) 1 : h ) ) d o 2 : h . absent subscript subscript 𝑜 : 2 superscript subscript product superscript 1 1 superscript subscript ^ 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript superscript subscript product superscript 1 1 superscript subscript 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript differential-d subscript 𝑜 : 2 \displaystyle\quad=\int_{o_{2:h}}\left(\prod_{h^{\prime}=1}^{h-1}\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)-\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)\right){\mathrm{d}}o_{2:h}. (E.8)

Following the arguments above, the difference can be decomposed as

h = 1 h 1 ^ 𝙻𝙻𝙼 t ( o h + 1 | ( o , g ) 1 : h ) h = 1 h 1 𝙻𝙻𝙼 t ( o h + 1 | ( o , g ) 1 : h ) superscript subscript product superscript 1 1 superscript subscript ^ 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript superscript subscript product superscript 1 1 superscript subscript 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript \displaystyle\prod_{h^{\prime}=1}^{h-1}\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)-\prod_{h^{\prime}=1}^{h-1}\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)
= h = 1 h 1 ( ^ 𝙻𝙻𝙼 t ( o h + 1 | ( o , g ) 1 : h ) 𝙻𝙻𝙼 t ( o h + 1 | ( o , g ) 1 : h ) ) absent superscript subscript superscript 1 1 superscript subscript ^ 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript superscript subscript 𝙻𝙻𝙼 𝑡 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript \displaystyle\quad=\sum_{h^{\prime}=1}^{h-1}\left(\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)-\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}}\right)\right)
k = h + 1 h 1 ^ 𝙻𝙻𝙼 t ( o k + 1 | ( o , g ) 1 : k ) k = 1 h 1 𝙻𝙻𝙼 t ( o k + 1 | ( o , g ) 1 : k ) \displaystyle\qquad\cdot\prod_{k=h^{\prime}+1}^{h-1}\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{k+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:k}\right)\cdot\prod_{k=1}^{h^{\prime}-1}\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{k+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:k}\right)
= h = 1 h 1 ( 𝙻𝙻𝙼 θ ^ ( o h + 1 | ( o , g ) 1 : h , t ) 𝙻𝙻𝙼 ( o h + 1 | ( o , g ) 1 : h , t ) ) absent superscript subscript superscript 1 1 subscript 𝙻𝙻𝙼 ^ 𝜃 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 𝙻𝙻𝙼 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 \displaystyle\quad=\sum_{h^{\prime}=1}^{h-1}\left(\mathtt{LLM}_{\widehat{\theta}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)-\mathtt{LLM}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)\right)
k = h + 1 h 1 ^ 𝙻𝙻𝙼 t ( o k + 1 | ( o , g ) 1 : k ) k = 1 h 1 𝙻𝙻𝙼 t ( o k + 1 | ( o , g ) 1 : k ) . \displaystyle\qquad\cdot\prod_{k=h^{\prime}+1}^{h-1}\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{k+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:k}\right)\cdot\prod_{k=1}^{h^{\prime}-1}\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{k+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:k}\right). (E.9)

Combine ( E.8 ) and ( E.9 ), it holds that

(vi ) \displaystyle\textbf{(vi}) h = 1 H o 2 : h h = 1 h 1 ( 𝙻𝙻𝙼 θ ^ ( o h + 1 | ( o , g ) 1 : h , t ) 𝙻𝙻𝙼 ( o h + 1 | ( o , g ) 1 : h , t ) ) absent superscript subscript 1 𝐻 subscript subscript 𝑜 : 2 superscript subscript superscript 1 1 subscript 𝙻𝙻𝙼 ^ 𝜃 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 𝙻𝙻𝙼 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 \displaystyle\leq\sum_{h=1}^{H}\int_{o_{2:h}}\sum_{h^{\prime}=1}^{h-1}\left(\mathtt{LLM}_{\widehat{\theta}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)-\mathtt{LLM}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)\right)
k = h + 1 h 1 ^ 𝙻𝙻𝙼 t ( o k + 1 | ( o , g ) 1 : k ) k = 1 h 1 𝙻𝙻𝙼 t ( o k + 1 | ( o , g ) 1 : k ) d o 2 : h \displaystyle\qquad\cdot\prod_{k=h^{\prime}+1}^{h-1}\widehat{\mathbb{P}}_{\mathtt{LLM}}^{t}\left(o_{k+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:k}\right)\cdot\prod_{k=1}^{h^{\prime}-1}\mathbb{P}_{\mathtt{LLM}}^{t}\left(o_{k+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:k}\right){\mathrm{d}}o_{2:h}
h = 1 H h = 1 h 1 𝔼 o 1 : h | t [ D TV ( 𝙻𝙻𝙼 θ ^ ( o h + 1 | ( o , g ) 1 : h , t ) , 𝙻𝙻𝙼 ( o h + 1 | ( o , g ) 1 : h , t ) ) ] . absent superscript subscript 1 𝐻 superscript subscript superscript 1 1 subscript 𝔼 conditional subscript 𝑜 : 1 superscript subscript 𝑡 delimited-[] subscript 𝐷 TV subscript 𝙻𝙻𝙼 ^ 𝜃 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 𝙻𝙻𝙼 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 \displaystyle\leq\sum_{h=1}^{H}\sum_{h^{\prime}=1}^{h-1}\mathbb{E}_{o_{1:h^{\prime}}|\mathcal{H}_{t}}\left[D_{\rm TV}\left(\mathtt{LLM}_{\widehat{\theta}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right),\mathtt{LLM}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)\right)\right]. (E.10)

Following ( E.10 ), for any policy π Π 𝜋 Π \pi\in\Pi , we have

t = 1 T 𝔼 t [ 𝒥 ^ t , 𝙻𝙻𝙼 ( π , ω t ) 𝒥 t , 𝙻𝙻𝙼 ( π , ω t ) ] superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 𝜋 superscript 𝜔 𝑡 subscript 𝒥 𝑡 𝙻𝙻𝙼 𝜋 superscript 𝜔 𝑡 \displaystyle\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\pi,\omega^{t})-\mathcal{J}_{t,\mathtt{LLM}}(\pi,\omega^{t})\right]
t = 1 T h = 1 H h = 1 h 1 𝔼 t 𝔼 ( o , g ) 1 : h | t [ D TV ( 𝙻𝙻𝙼 θ ^ ( o h + 1 | ( o , g ) 1 : h , t ) , 𝙻𝙻𝙼 ( o h + 1 | ( o , g ) 1 : h , t ) ) ] absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 superscript subscript superscript 1 1 subscript 𝔼 subscript 𝑡 subscript 𝔼 conditional subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 delimited-[] subscript 𝐷 TV subscript 𝙻𝙻𝙼 ^ 𝜃 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 𝙻𝙻𝙼 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 \displaystyle\quad\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\sum_{h^{\prime}=1}^{h-1}\mathbb{E}_{\mathcal{H}_{t}}\mathbb{E}_{(o,g)_{1:h^{\prime}}|\mathcal{H}_{t}}\left[D_{\rm TV}\left(\mathtt{LLM}_{\widehat{\theta}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right),\mathtt{LLM}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)\right)\right]
t = 1 T h = 1 H h = 1 h 1 λ S , 1 λ S , 2 1 𝔼 ¯ 𝒟 𝙻𝙻𝙼 [ D TV ( 𝙻𝙻𝙼 θ ^ ( o h + 1 | ( o , g ) 1 : h , t ) , 𝙻𝙻𝙼 ( o h + 1 | ( o , g ) 1 : h , t ) ) ] absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 superscript subscript superscript 1 1 subscript 𝜆 𝑆 1 superscript subscript 𝜆 𝑆 2 1 subscript ¯ 𝔼 subscript 𝒟 𝙻𝙻𝙼 delimited-[] subscript 𝐷 TV subscript 𝙻𝙻𝙼 ^ 𝜃 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 𝙻𝙻𝙼 conditional subscript 𝑜 superscript 1 subscript 𝑜 𝑔 : 1 superscript subscript 𝑡 \displaystyle\quad\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\sum_{h^{\prime}=1}^{h-1}\lambda_{S,1}\lambda_{S,2}^{-1}\cdot\bar{\mathbb{E}}_{\mathcal{D}_{\mathtt{LLM}}}\left[D_{\rm TV}\left(\mathtt{LLM}_{\widehat{\theta}}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right),\mathtt{LLM}\left(o_{h^{\prime}+1}\hskip 1.42262pt|\hskip 1.42262pt(o,g)_{1:h^{\prime}},\mathcal{H}_{t}\right)\right)\right]
H 2 T λ S , 1 λ S , 2 1 Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) absent superscript 𝐻 2 𝑇 subscript 𝜆 𝑆 1 superscript subscript 𝜆 𝑆 2 1 subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \displaystyle\quad\leq H^{2}T\lambda_{S,1}\lambda_{S,2}^{-1}\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta) (E.11)

where the first inequality follows Theorem 2 and Assumption B.2 . Based on Proposition B.1 , the term (vii ) can be upper bounded using the Bayesian aggregated arguments such that

(vii ) \displaystyle\textbf{(vii}) = sup g 1 : H 1 z z h = 1 H o h ( z ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) z ( o h | o 1 , 𝐝𝐨 g 1 : h 1 ) ) 𝒟 ( z | t ) d o h H z z 𝒟 ( z | t ) . absent subscript supremum subscript 𝑔 : 1 𝐻 1 subscript superscript 𝑧 𝑧 superscript subscript 1 𝐻 subscript subscript 𝑜 subscript superscript 𝑧 conditional subscript 𝑜 subscript 𝑜 1 𝐝𝐨 subscript 𝑔 : 1 1 subscript 𝑧 conditional subscript 𝑜 subscript 𝑜 1 𝐝𝐨 subscript 𝑔 : 1 1 subscript 𝒟 conditional superscript 𝑧 subscript 𝑡 differential-d subscript 𝑜 𝐻 subscript superscript 𝑧 𝑧 subscript 𝒟 conditional superscript 𝑧 subscript 𝑡 \displaystyle=\sup_{g_{1:H-1}}\sum_{z^{\prime}\neq z}\sum_{h=1}^{H}\int_{o_{h}}(\mathbb{P}_{z^{\prime}}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right)-\mathbb{P}_{z}\left(o_{h}\hskip 1.42262pt|\hskip 1.42262pto_{1},\mathrm{\bf do}\hskip 1.70717ptg_{1:h-1}\right))\cdot\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t}){\mathrm{d}}o_{h}\leq H\sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t}).

Following the arguments above, for any policy π Π 𝜋 Π \pi\in\Pi , it holds that

t = 1 T 𝔼 t [ 𝒥 ^ t , 𝙻𝙻𝙼 ( π , ω t ) 𝒥 z ( π , ω t ) ] H t = 1 T z z 𝔼 t [ 𝒟 ( z | t ) ] , superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 𝜋 superscript 𝜔 𝑡 subscript 𝒥 𝑧 𝜋 superscript 𝜔 𝑡 𝐻 superscript subscript 𝑡 1 𝑇 subscript superscript 𝑧 𝑧 subscript 𝔼 subscript 𝑡 delimited-[] subscript 𝒟 conditional superscript 𝑧 subscript 𝑡 \displaystyle\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\pi,\omega^{t})-\mathcal{J}_{z}(\pi,\omega^{t})\right]\leq H\sum_{t=1}^{T}\sum_{z^{\prime}\neq z}\mathbb{E}_{\mathcal{H}_{t}}\left[\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathcal{H}_{t})\right], (E.12)

Combine ( E.11 ), ( E.12 ) and the similar concentration arguments of posterior probability in ( D.20 ), denoted by event 2 subscript 2 \mathcal{E}_{2} (see proof of Theorem 5.7 in § D.2 ), it holds that

(ii) + (iv) (ii) (iv) \displaystyle\textbf{(ii)}+\textbf{(iv)} t = 1 T 𝔼 t [ ( 𝒥 z ( π ^ z , ω t ) 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ z , ω t ) ) 𝟙 ( 2 holds ) ] absent superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript 𝒥 𝑧 superscript subscript ^ 𝜋 𝑧 superscript 𝜔 𝑡 subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript subscript ^ 𝜋 𝑧 superscript 𝜔 𝑡 1 subscript 2 holds \displaystyle\leq\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\left(\mathcal{{J}}_{z}(\widehat{\pi}_{z}^{*},\omega^{t})-\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}_{z}^{*},\omega^{t})\right)\cdot\operatorname{\mathds{1}}\left(\mathcal{E}_{2}\text{~{}holds}\right)\right]
+ t = 1 T 𝔼 t [ ( 𝒥 ^ t , 𝙻𝙻𝙼 ( π ^ t , ω t ) 𝒥 z ( π ^ t , ω t ) ) 𝟙 ( 2 holds ) ] + 2 H T δ superscript subscript 𝑡 1 𝑇 subscript 𝔼 subscript 𝑡 delimited-[] subscript ^ 𝒥 𝑡 𝙻𝙻𝙼 superscript ^ 𝜋 𝑡 superscript 𝜔 𝑡 subscript 𝒥 𝑧 superscript ^ 𝜋 𝑡 superscript 𝜔 𝑡 1 subscript 2 holds 2 𝐻 𝑇 𝛿 \displaystyle\quad+\sum_{t=1}^{T}\mathbb{E}_{\mathcal{H}_{t}}\left[\left(\mathcal{\widehat{J}}_{t,\mathtt{LLM}}(\widehat{\pi}^{t},\omega^{t})-\mathcal{J}_{z}(\widehat{\pi}^{t},\omega^{t})\right)\cdot\operatorname{\mathds{1}}\left(\mathcal{E}_{2}\text{~{}holds}\right)\right]+2HT\delta
2 H 2 T λ S , 1 λ S , 2 1 Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) + 2 H T δ absent 2 superscript 𝐻 2 𝑇 subscript 𝜆 𝑆 1 superscript subscript 𝜆 𝑆 2 1 subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 𝐻 𝑇 𝛿 \displaystyle\leq 2H^{2}T\lambda_{S,1}\lambda_{S,2}^{-1}\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta)+2HT\delta
+ c 0 2 H log ( c 𝒵 | 𝒵 | / δ ) ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) 1 subscript 𝑐 0 2 𝐻 subscript 𝑐 𝒵 𝒵 𝛿 superscript 𝜂 italic-ϵ 𝐻 subscript superscript 𝜆 1 𝑅 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 1 \displaystyle\qquad+c_{0}\cdot 2H\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot\left(\eta\epsilon-H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)^{-1} (E.13)

Step 4. Conclude the Proof based on Step 1, Step 2, and Step 3.
Combine ( E.5 ), ( E.7 ) and ( E.13 ), we have

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) c 0 2 H log ( c 𝒵 | 𝒵 | / δ ) ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) 1 (viii ) + 4 H T δ \displaystyle\leq\underbrace{c_{0}\cdot 2H\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot\left(\eta\epsilon-H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)^{-1}}_{\textbf{(viii})}+4HT\delta
+ 2 H T η 1 ( η ϵ H λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 ) (ix ) + 2 H 2 T λ S , 1 λ S , 2 1 Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) \displaystyle\qquad+\underbrace{2HT\eta^{-1}\left(\eta\epsilon-H\lambda^{-1}_{R}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}\right)}_{\textbf{(ix})}+2H^{2}T\lambda_{S,1}\lambda_{S,2}^{-1}\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta)
+ 2 H 2 T ( η λ R ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 + 2 H 2 T λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 superscript 𝐻 2 𝑇 superscript 𝜂 subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 2 superscript 𝐻 2 𝑇 superscript subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \displaystyle\qquad+2H^{2}T(\eta\lambda_{R})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}+2H^{2}T\lambda_{R}^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)
𝒪 ( H log ( c 𝒵 | 𝒵 | / δ ) T / η + H 2 T Δ p , wm ( N p , T p , H , δ , ξ ) ) + 4 H T δ , absent 𝒪 𝐻 subscript 𝑐 𝒵 𝒵 𝛿 𝑇 𝜂 superscript 𝐻 2 𝑇 subscript Δ p wm subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 𝜉 4 𝐻 𝑇 𝛿 \displaystyle\leq\mathcal{O}\Big{(}H\sqrt{\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)\cdot T/\eta}+H^{2}T\cdot\Delta_{\rm p,wm}(N_{\rm p},T_{\rm p},H,\delta,\xi)\Big{)}+4HT\delta, (E.14)

if we choose ϵ = ( log ( c 𝒵 | 𝒵 | T ) / T η ) 1 / 2 + H ( η λ min ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 italic-ϵ superscript subscript 𝑐 𝒵 𝒵 𝑇 𝑇 𝜂 1 2 𝐻 superscript 𝜂 subscript 𝜆 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 \epsilon=(\log(c_{\mathcal{Z}}|\mathcal{Z}|\sqrt{T})/T\eta)^{1/2}+H(\eta\lambda_{\min})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2} to strike an exploration-exploitation balance between (viii ) and (ix ). Thus, the cumulative pretraining error follows

Δ p , wm subscript Δ p wm \displaystyle\Delta_{\rm p,wm} ( N p , T p , H , δ , ξ ) = 2 ( η λ R ) 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) 2 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 𝜉 2 superscript 𝜂 subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 superscript subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 \displaystyle(N_{\rm p},T_{\rm p},H,\delta,\xi)=2(\eta\lambda_{R})^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)^{2}
+ 2 λ R 1 Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) + 2 λ S , 1 λ S , 2 1 Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) . 2 superscript subscript 𝜆 𝑅 1 subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 2 subscript 𝜆 𝑆 1 superscript subscript 𝜆 𝑆 2 1 subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \displaystyle+2\lambda_{R}^{-1}\cdot\Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta)+2\lambda_{S,1}\lambda_{S,2}^{-1}\cdot\Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta).

Here, ξ = ( η , λ S , 1 , λ S , 2 , λ R ) 𝜉 𝜂 subscript 𝜆 𝑆 1 subscript 𝜆 𝑆 2 subscript 𝜆 𝑅 \xi=(\eta,\lambda_{S,1},\lambda_{S,2},\lambda_{R}) denotes the set of distinguishability and coverage coefficients in Definition 4.4 and Assumption 5.6 , and Δ 𝙻𝙻𝙼 ( N p , T p , H , δ ) subscript Δ 𝙻𝙻𝙼 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \Delta_{\mathtt{LLM}}(N_{\rm p},T_{\rm p},H,\delta) and Δ 𝚁𝚎𝚙 ( N p , T p , H , δ ) subscript Δ 𝚁𝚎𝚙 subscript 𝑁 p subscript 𝑇 p 𝐻 𝛿 \Delta_{\mathtt{Rep}}(N_{\rm p},T_{\rm p},H,\delta) are pretraining errors defined in Theorem 2 and Theorem 5.5 . By taking δ = 1 / T 𝛿 1 𝑇 \delta=1/\sqrt{T} , we complete the entire proof. ∎

E.3 Proof of Corollary B.4

The proof is similar to that in § C.2 .
Proof Sketch of Corollary B.4 . We first verify the claim in ( B.2 ), which is akin to Proposition 4.2 . Note that for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , based on the law of total probability, it holds that

π h , 𝙻𝙻𝙼 t ( 𝐠 h t | τ h t , ω t ) superscript subscript 𝜋 𝙻𝙻𝙼 𝑡 conditional superscript subscript 𝐠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle\pi_{h,\mathtt{LLM}}^{t}\big{(}\mathbf{g}_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\big{)} = k 𝒦 𝙻𝙻𝙼 ( g h , k t | 𝚙𝚝 h , k t ) absent subscript product 𝑘 𝒦 𝙻𝙻𝙼 conditional superscript subscript 𝑔 𝑘 𝑡 superscript subscript 𝚙𝚝 𝑘 𝑡 \displaystyle=\prod_{k\in\mathcal{K}}\mathtt{LLM}\big{(}g_{h,k}^{t}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h,k}^{t}\big{)}
= k 𝒦 ( z 𝒵 ( g h , k t | 𝚙𝚝 h , k t , z ) 𝒟 ( z | 𝚙𝚝 h , k t ) ) absent subscript product 𝑘 𝒦 subscript 𝑧 𝒵 conditional superscript subscript 𝑔 𝑘 𝑡 superscript subscript 𝚙𝚝 𝑘 𝑡 𝑧 subscript 𝒟 conditional 𝑧 superscript subscript 𝚙𝚝 𝑘 𝑡 \displaystyle=\prod_{k\in\mathcal{K}}\left(\sum_{z\in\mathcal{Z}}\mathbb{P}\left(g_{h,k}^{t}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h,k}^{t},z\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h,k}^{t}\right)\right)
= k 𝒦 ( z 𝒵 π z , h , k ( g h , k t | τ h t , ω t ) 𝒟 ( z | 𝚙𝚝 h t ) ) , absent subscript product 𝑘 𝒦 subscript 𝑧 𝒵 subscript superscript 𝜋 𝑧 𝑘 conditional superscript subscript 𝑔 𝑘 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 subscript 𝒟 conditional 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle=\prod_{k\in\mathcal{K}}\left(\sum_{z\in\mathcal{Z}}\pi^{*}_{z,h,k}\left(g_{h,k}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t}\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right)\right), (E.15)

where the first equation arises from the autoregressive manner of LLM, and the last equation follows the generating distribution. The Planner takes a mixture policy of π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} and π 𝙻𝙻𝙼 subscript 𝜋 𝙻𝙻𝙼 \pi_{\mathtt{LLM}} such that

π h t ( 𝐠 h t | τ h t , ω t ) ( 1 ϵ ) π h , 𝙻𝙻𝙼 t ( 𝐠 h t | τ h t , ω t ) + ϵ π h , 𝚎𝚡𝚙 ( 𝐠 h t | τ h t ) , similar-to superscript subscript 𝜋 𝑡 conditional superscript subscript 𝐠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 1 italic-ϵ superscript subscript 𝜋 𝙻𝙻𝙼 𝑡 conditional superscript subscript 𝐠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 italic-ϵ subscript 𝜋 𝚎𝚡𝚙 conditional superscript subscript 𝐠 𝑡 superscript subscript 𝜏 𝑡 \pi_{h}^{t}(\mathbf{g}_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})\sim(1-\epsilon)\cdot\pi_{h,\mathtt{LLM}}^{t}(\mathbf{g}_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t},\omega^{t})+\epsilon\cdot\pi_{h,\mathtt{exp}}(\mathbf{g}_{h}^{t}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h}^{t}), (E.16)

for any ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] given an η 𝜂 \eta -distinguishable policy π 𝚎𝚡𝚙 subscript 𝜋 𝚎𝚡𝚙 \pi_{\mathtt{exp}} (see Definition 4.4 ). Given a sequence of high-level tasks { ω t } t [ T ] subscript superscript 𝜔 𝑡 𝑡 delimited-[] 𝑇 \{\omega^{t}\}_{t\in[T]} , the regret can be decomposed as

Reg ( T ) Reg 𝑇 \displaystyle\text{Reg}(T) t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h t , τ h t , ω t ) ] + H T ϵ , absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 𝐻 𝑇 italic-ϵ \displaystyle\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]+HT\epsilon, (E.17)

Recall that ( C.3 ) indicates that for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] , we have

( π z , h π h , 𝙻𝙻𝙼 t ) ( 𝐠 h | τ h , ω ) superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 conditional subscript 𝐠 subscript 𝜏 𝜔 \displaystyle\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)(\mathbf{g}_{h}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega)
= k 𝒦 ( z 𝒵 π z , h , k ( g h , k | τ h , ω ) 𝒟 ( z | 𝚙𝚝 h t ) ) k 𝒦 π z , h , k ( g h , k | τ h , ω ) absent subscript product 𝑘 𝒦 subscript superscript 𝑧 𝒵 subscript superscript 𝜋 superscript 𝑧 𝑘 conditional subscript 𝑔 𝑘 subscript 𝜏 𝜔 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 subscript product 𝑘 𝒦 superscript subscript 𝜋 𝑧 𝑘 conditional subscript 𝑔 𝑘 subscript 𝜏 𝜔 \displaystyle\quad=\prod_{k\in\mathcal{K}}\left(\sum_{z^{\prime}\in\mathcal{Z}}\pi^{*}_{z^{\prime},h,k}\left(g_{h,k}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right)\right)-\prod_{k\in\mathcal{K}}\pi_{z,h,k}^{*}(g_{h,k}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega)
H k 𝒦 ( z z ( π z , h , k π z , h , k ) ( g h , k | τ h , ω ) 𝒟 ( z | 𝚙𝚝 h t ) ) absent 𝐻 subscript 𝑘 𝒦 subscript superscript 𝑧 𝑧 subscript superscript 𝜋 superscript 𝑧 𝑘 subscript superscript 𝜋 𝑧 𝑘 conditional subscript 𝑔 𝑘 subscript 𝜏 𝜔 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle\quad\leq H\sum_{k\in\mathcal{K}}\left(\sum_{z^{\prime}\neq z}(\pi^{*}_{z^{\prime},h,k}-\pi^{*}_{z,h,k})\left(g_{h,k}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right)\right)
k = 1 k 1 ( z ′′ 𝒵 π z ′′ , h , k ( g h , k , k | τ h , ω ) 𝒟 ( z | 𝚙𝚝 h t ) ) k = k + 1 K π z , h ( g h , k | τ h , ω ) . \displaystyle\quad\qquad\cdot\prod_{k^{\prime}=1}^{k-1}\left(\sum_{z^{\prime\prime}\in\mathcal{Z}}\pi^{*}_{z^{\prime\prime},h,k^{\prime}}\left(g_{h,k,k^{\prime}}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega\right)\cdot\mathbb{P}_{\mathcal{D}}\left(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right)\right)\cdot\prod_{k^{\prime}=k+1}^{K}\pi_{z,h}^{*}(g_{h,k}\hskip 1.42262pt|\hskip 1.42262pt\tau_{h},\omega).

Following this, we have

( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h t , τ h t , ω t ) superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h}^{t},\tau_{h}^{t},\omega^{t}) H K z z 𝒟 ( z | 𝚙𝚝 h t ) , absent 𝐻 𝐾 subscript superscript 𝑧 𝑧 subscript 𝒟 conditional 𝑧 superscript subscript 𝚙𝚝 𝑡 \displaystyle\leq HK\cdot\sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}\left(z\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t}\right), (E.18)

for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] . Based on Lemma C.1 and the similar arguments in the proof Theorem 4.6 in § C.2 , with probability at least 1 δ 1 𝛿 1-\delta , the following event 1 subscript 1 \mathcal{E}_{1} holds: for all ( h , t ) [ H ] × [ T ] 𝑡 delimited-[] 𝐻 delimited-[] 𝑇 (h,t)\in[H]\times[T] ,

z z 𝒟 ( z | 𝚙𝚝 h t ) 𝒪 ( min { log ( c 𝒵 | 𝒵 | / δ ) η 1 / | 𝒳 𝚎𝚡𝚙 t 1 | , 1 } ) , subscript superscript 𝑧 𝑧 subscript 𝒟 conditional superscript 𝑧 superscript subscript 𝚙𝚝 𝑡 𝒪 subscript 𝑐 𝒵 𝒵 𝛿 superscript 𝜂 1 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 1 \sum_{z^{\prime}\neq z}\mathbb{P}_{\mathcal{D}}(z^{\prime}\hskip 1.42262pt|\hskip 1.42262pt\mathtt{pt}_{h}^{t})\leq\mathcal{O}\left(\min\left\{\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right)\eta^{-1}/|\mathcal{X}^{t-1}_{\mathtt{exp}}|,1\right\}\right), (E.19)

where 𝒳 𝚎𝚡𝚙 t = { i [ t ] : π i = π 𝚎𝚡𝚙 } subscript superscript 𝒳 𝑡 𝚎𝚡𝚙 conditional-set 𝑖 delimited-[] 𝑡 superscript 𝜋 𝑖 subscript 𝜋 𝚎𝚡𝚙 \mathcal{X}^{t}_{\mathtt{exp}}=\{i\in[t]:\pi^{i}=\pi_{\mathtt{exp}}\} denotes the set of exploration episodes. Based on ( E.15 ), ( E.18 ) and conditioned on 1 subscript 1 \mathcal{E}_{1} , it holds that

t = 1 T superscript subscript 𝑡 1 𝑇 \displaystyle\sum_{t=1}^{T} h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h t , τ h t , ω t ) ] superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript 𝜔 𝑡 \displaystyle\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h}^{t},\tau_{h}^{t},\omega^{t})\right]
2 log ( c 𝒵 | 𝒵 | / δ ) H K η 1 t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 τ h t z π t [ min { 1 / | 𝒳 𝚎𝚡𝚙 t 1 | , 1 } ] , absent 2 subscript 𝑐 𝒵 𝒵 𝛿 𝐻 𝐾 superscript 𝜂 1 superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] 1 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 1 \displaystyle\leq 2\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)HK\eta^{-1}\cdot\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{\tau_{h}^{t}\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\min\left\{1/|\mathcal{X}^{t-1}_{\mathtt{exp}}|,1\right\}\right], (E.20)

Note that 𝟙 ( π t = π 𝚎𝚡𝚙 ) iid Bernuolli ( ϵ ) 1 superscript 𝜋 𝑡 subscript 𝜋 𝚎𝚡𝚙 iid similar-to Bernuolli italic-ϵ \mathds{1}(\pi^{t}=\pi_{\mathtt{exp}})\overset{\rm iid}{\sim}\text{Bernuolli}(\epsilon) for all t [ T ] 𝑡 delimited-[] 𝑇 t\in[T] . Besides, with probability at least 1 δ 1 𝛿 1-\delta , the following event 2 subscript 2 \mathcal{E}_{2} holds:

t = 1 T min { 1 / | 𝒳 𝚎𝚡𝚙 t 1 | , 1 } 𝒪 ( ϵ 1 log ( T log T / δ ) ) . superscript subscript 𝑡 1 𝑇 1 subscript superscript 𝒳 𝑡 1 𝚎𝚡𝚙 1 𝒪 superscript italic-ϵ 1 𝑇 𝑇 𝛿 \displaystyle\sum_{t=1}^{T}\min\left\{1/|\mathcal{X}^{t-1}_{\mathtt{exp}}|,1\right\}\leq\mathcal{O}(\epsilon^{-1}\log(T\log T/\delta)). (E.21)

based on Lemma F.5 . Combine ( E.17 ), ( E.20 ) and ( E.21 ), it follows that

Reg z ( T ) subscript Reg 𝑧 𝑇 \displaystyle{\rm Reg}_{z}(T) t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h , τ h , ω t ) 𝟙 ( 1 2 holds ) ] absent superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 subscript 𝑠 subscript 𝜏 superscript 𝜔 𝑡 1 subscript 1 subscript 2 holds \displaystyle\leq\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h},\tau_{h},\omega^{t})\operatorname{\mathds{1}}\left(\mathcal{E}_{1}\cap\mathcal{E}_{2}\text{~{}holds}\right)\right]
+ t = 1 T h = 1 H 𝔼 t i = 1 t 1 z π i 𝔼 ( s h t , τ h t ) z π t [ ( π z , h π h , 𝙻𝙻𝙼 t ) Q z , h ( s h , τ h , ω t ) 𝟙 ( 1 2 fails ) ] + H T ϵ superscript subscript 𝑡 1 𝑇 superscript subscript 1 𝐻 subscript 𝔼 similar-to subscript 𝑡 superscript subscript tensor-product 𝑖 1 𝑡 1 superscript subscript 𝑧 subscript 𝜋 𝑖 subscript 𝔼 similar-to superscript subscript 𝑠 𝑡 superscript subscript 𝜏 𝑡 superscript subscript 𝑧 superscript 𝜋 𝑡 delimited-[] superscript subscript 𝜋 𝑧 subscript superscript 𝜋 𝑡 𝙻𝙻𝙼 superscript subscript 𝑄 𝑧 subscript 𝑠 subscript 𝜏 superscript 𝜔 𝑡 1 subscript 1 subscript 2 fails 𝐻 𝑇 italic-ϵ \displaystyle\quad+\sum_{t=1}^{T}\sum_{h=1}^{H}\mathbb{E}_{\mathcal{H}_{t}\sim\bigotimes_{i=1}^{t-1}\mathbb{P}_{z}^{\pi_{i}}}\mathbb{E}_{(s_{h}^{t},\tau_{h}^{t})\sim\mathbb{P}_{z}^{\pi^{t}}}\left[\left(\pi_{z,h}^{*}-{\pi^{t}_{h,\mathtt{LLM}}}\right)Q_{z,h}^{*}(s_{h},\tau_{h},\omega^{t})\operatorname{\mathds{1}}\left(\mathcal{E}_{1}\cap\mathcal{E}_{2}\text{~{}fails}\right)\right]+HT\epsilon
𝒪 ( log ( c 𝒵 | 𝒵 | / δ ) H 2 K log ( T log T / δ ) ( η ϵ ) 1 + H T ϵ + H T log ( 1 / δ ) + 2 H T δ ) absent 𝒪 subscript 𝑐 𝒵 𝒵 𝛿 superscript 𝐻 2 𝐾 𝑇 𝑇 𝛿 superscript 𝜂 italic-ϵ 1 𝐻 𝑇 italic-ϵ 𝐻 𝑇 1 𝛿 2 𝐻 𝑇 𝛿 \displaystyle\leq\mathcal{O}\Big{(}\log(c_{\mathcal{Z}}|\mathcal{Z}|/\delta)H^{2}K\log(T\log T/\delta)\cdot(\eta\epsilon)^{-1}+HT\epsilon+H\sqrt{T}\log(1/\delta)+2HT\delta\Big{)}
𝒪 ~ ( H 3 2 T K / η log ( c 𝒵 | 𝒵 | / δ ) ) , absent ~ 𝒪 superscript 𝐻 3 2 𝑇 𝐾 𝜂 subscript 𝑐 𝒵 𝒵 𝛿 \displaystyle\leq\tilde{\mathcal{O}}\left(H^{\frac{3}{2}}\sqrt{TK/\eta\cdot\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right)}\right),

where we choose to expolre with probability ϵ = ( H K log ( c 𝒵 | 𝒵 | / δ ) / T η ) 1 / 2 italic-ϵ superscript 𝐻 𝐾 subscript 𝑐 𝒵 𝒵 𝛿 𝑇 𝜂 1 2 \epsilon=(HK\log\left(c_{\mathcal{Z}}|\mathcal{Z}|/\delta\right)/T\eta)^{1/2} in the last inequality. If we take δ = 1 / T 𝛿 1 𝑇 \delta=1/\sqrt{T} in the arguments above, then we conclude the proof of Corollary B.4 . \Box

Appendix F Technical Lemmas

Lemma F.1 (Martingale Concentration Inequality) .

Let X 1 , , X T subscript 𝑋 1 subscript 𝑋 𝑇 X_{1},\dots,X_{T} be a sequence of real-valued random variables adapted to a filter ( t ) t T subscript subscript 𝑡 𝑡 𝑇 (\mathscr{F}_{t})_{t\leq T} . For any δ ( 0 , 1 ) 𝛿 0 1 \delta\in(0,1) and λ > 0 𝜆 0 \lambda>0 , it holds that

( T [ T ] : t = 1 T X t t = 1 T 1 λ log 𝔼 [ exp ( λ X t ) | t 1 ] + 1 λ log ( 1 / δ ) ) δ . \mathbb{P}\left(\exists T^{\prime}\in[T]:-\sum_{t=1}^{T^{\prime}}X_{t}\geq\sum_{t=1}^{T^{\prime}}\frac{1}{\lambda}\log\mathbb{E}\left[\exp(-\lambda X_{t})|\mathscr{F}_{t-1}\right]+\frac{1}{\lambda}\log\left(1/\delta\right)\right)\leq\delta.

Proof of Lemma F.1 . See Lemma A.4 in Foster et al., ( 2021 ) and Theorem 13.2 in Zhang, ( 2023 ) for detailed proof. Lemma A.4 in Foster et al., ( 2021 ) is a special case by taking λ = 1 𝜆 1 \lambda=1 .

Lemma F.2 (Donsker-Varadhan) .

Let P 𝑃 P and Q 𝑄 Q be the probability measures over 𝒳 𝒳 \mathcal{X} , then

D KL ( P Q ) = sup f { 𝔼 x P [ f ( x ) ] log 𝔼 x Q [ exp ( f ( x ) ) ] } , subscript 𝐷 KL conditional 𝑃 𝑄 𝑓 supremum subscript 𝔼 similar-to 𝑥 𝑃 delimited-[] 𝑓 𝑥 subscript 𝔼 similar-to 𝑥 𝑄 delimited-[] 𝑓 𝑥 D_{\rm KL}(P\hskip 1.42262pt\|\hskip 1.42262ptQ)=\underset{f\in\mathcal{F}}{\sup}\left\{\mathbb{E}_{x\sim P}\left[f(x)\right]-\log\mathbb{E}_{x\sim Q}\left[\exp(f(x))\right]\right\},

where = { f : 𝒳 | 𝔼 x Q [ exp ( f ( x ) ) ] } conditional-set 𝑓 maps-to 𝒳 conditional subscript 𝔼 similar-to 𝑥 𝑄 delimited-[] 𝑓 𝑥 \mathcal{F}=\{f:\mathcal{X}\mapsto\mathbb{R}\hskip 1.42262pt|\hskip 1.42262pt\mathbb{E}_{x\sim Q}\left[\exp(f(x))\right]\leq\infty\} .

Proof of Lemma F.2 . See Donsker and Varadhan, ( 1976 ) for detailed proof.

Lemma F.3 (MLE guarantee) .

Let \mathcal{F} be finite function class and there exists f superscript 𝑓 f^{*}\in\mathcal{F} such that f ( x , y ) = ( y | x ) superscript 𝑓 𝑥 𝑦 conditional 𝑦 𝑥 f^{*}(x,y)=\mathbb{P}(y\hskip 1.42262pt|\hskip 1.42262ptx) , where ( y | x ) conditional 𝑦 𝑥 \mathbb{P}(y\hskip 1.42262pt|\hskip 1.42262ptx) is the conditional distribution for estimation. Given a dataset 𝒟 = { x i , y i } i [ N ] 𝒟 subscript subscript 𝑥 𝑖 subscript 𝑦 𝑖 𝑖 delimited-[] 𝑁 \mathcal{D}=\{x_{i},y_{i}\}_{i\in[N]} where x i 𝒟 ( | x 1 : i 1 , y 1 : i 1 ) x_{i}\sim\mathbb{P}_{\mathcal{D}}(\cdot\hskip 1.42262pt|\hskip 1.42262ptx_{1:i-1},y_{1:i-1}) and y i 𝒟 ( | x i ) y_{i}\sim\mathbb{P}_{\mathcal{D}}(\cdot\hskip 1.42262pt|\hskip 1.42262ptx_{i}) for all i [ N ] 𝑖 delimited-[] 𝑁 i\in[N] , we have

𝔼 ¯ 𝒟 [ D TV 2 ( f ^ ( x , ) , f ( x , ) ) ] 2 log ( N | | / δ ) / N subscript ¯ 𝔼 𝒟 delimited-[] superscript subscript 𝐷 TV 2 ^ 𝑓 𝑥 superscript 𝑓 𝑥 2 𝑁 𝛿 𝑁 \bar{\mathbb{E}}_{\mathcal{D}}\left[D_{\rm TV}^{2}\left(\widehat{f}(x,\cdot),f^{*}(x,\cdot)\right)\right]\leq 2\log(N|\mathcal{F}|/\delta)/N

with propbability at least 1 δ 1 𝛿 1-\delta , where f ^ ^ 𝑓 \widehat{f} is the maximum likelihood estimator such that

f ^ := argmax f 𝔼 ^ 𝒟 [ log f ( x , y ) ] . assign ^ 𝑓 𝑓 argmax subscript ^ 𝔼 𝒟 delimited-[] f x y \widehat{f}:=\underset{f\in\mathcal{F}}{\rm argmax}\ \mathbb{\widehat{E}}_{\mathcal{D}}\left[\log f(x,y)\right].

Proof of Lemma F.3 . See Theorem 21 in Agarwal et al., ( 2020 ) for detailed proof.

Lemma F.4 (Performance Difference Lemma for POMDP) .

Consider policies π , π Π 𝜋 superscript 𝜋 Π \pi,\pi^{\prime}\in\Pi , it holds

𝒥 ( π ) 𝒥 ( π ) = h = 1 H 𝔼 π [ Q h π ( s h , τ h , g h ) V h π ( s h , τ h ) ] . 𝒥 𝜋 𝒥 superscript 𝜋 superscript subscript 1 𝐻 subscript 𝔼 𝜋 delimited-[] superscript subscript 𝑄 superscript 𝜋 subscript 𝑠 subscript 𝜏 subscript 𝑔 superscript subscript 𝑉 superscript 𝜋 subscript 𝑠 subscript 𝜏 \mathcal{J}(\pi)-\mathcal{J}(\pi^{\prime})=\sum_{h=1}^{H}\mathbb{E}_{\pi}\left[Q_{h}^{\pi^{\prime}}(s_{h},\tau_{h},g_{h})-V_{h}^{\pi^{\prime}}(s_{h},\tau_{h})\right].

For fixed policy π Π 𝜋 Π \pi\in\Pi under different POMDPs, denoted by \mathcal{M} and superscript \mathcal{M}^{\prime} , then it holds that

𝒥 ( π ) 𝒥 ( π ) = h = 1 H 𝔼 π [ ( h , V h + 1 , π h , V h + 1 , π ) ( s h , τ h , g h ) ] , subscript 𝒥 𝜋 subscript 𝒥 superscript 𝜋 superscript subscript 1 𝐻 superscript subscript 𝔼 𝜋 delimited-[] subscript superscript subscript 𝑉 1 superscript 𝜋 subscript superscript superscript subscript 𝑉 1 superscript 𝜋 subscript 𝑠 subscript 𝜏 subscript 𝑔 \mathcal{J}_{\mathcal{M}}(\pi)-\mathcal{J}_{\mathcal{M^{\prime}}}(\pi)=\sum_{h=1}^{H}\mathbb{E}_{\mathcal{M}}^{\pi}\left[(\mathbb{P}_{h,\mathcal{M}}V_{h+1,\mathcal{M}^{\prime}}^{\pi}-\mathbb{P}_{h,\mathcal{M}^{\prime}}V_{h+1,\mathcal{M}^{\prime}}^{\pi})(s_{h},\tau_{h},g_{h})\right],

where h , V h + 1 , π ( s h , τ h , g h ) = V h + 1 , π ( , ) , h , ( , | s h , τ h , g h ) 𝒮 × 𝒯 \mathbb{P}_{h,\mathcal{M}}V_{h+1,\mathcal{M}^{\prime}}^{\pi}(s_{h},\tau_{h},g_{h})=\langle V_{h+1,\mathcal{M}^{\prime}}^{\pi}(\cdot,\cdot),\mathbb{P}_{h,\mathcal{M}}(\cdot,\cdot\hskip 1.42262pt|\hskip 1.42262pts_{h},\tau_{h},g_{h})\rangle_{{\mathcal{S}}\times{\mathcal{T}}^{*}} .

Lemma F.5 .

Let X t iid Bernuolli ( ρ ) subscript 𝑋 𝑡 iid similar-to Bernuolli 𝜌 X_{t}\overset{\rm iid}{\sim}\text{Bernuolli}(\rho) and Y t = τ = 1 t X τ subscript 𝑌 𝑡 superscript subscript 𝜏 1 𝑡 subscript 𝑋 𝜏 Y_{t}=\sum_{\tau=1}^{t}X_{\tau} . For any δ ( 0 , 1 ) 𝛿 0 1 \delta\in(0,1) and ρ > 0 𝜌 0 \rho>0 , with probability greater than 1 δ 1 𝛿 1-\delta , it holds that t = 1 T min { 1 / Y t , 1 } 𝒪 ( ρ 1 log ( T log T / δ ) ) superscript subscript 𝑡 1 𝑇 1 subscript 𝑌 𝑡 1 𝒪 superscript 𝜌 1 𝑇 𝑇 𝛿 \sum_{t=1}^{T}\min\left\{1/Y_{t},1\right\}\leq\mathcal{O}(\rho^{-1}\log(T\log T/\delta)) .

Proof of Lemma F.5 .

Note that { Y t } t [ T ] subscript subscript 𝑌 𝑡 𝑡 delimited-[] 𝑇 \{Y_{t}\}_{t\in[T]} is non-decreasing and it holds that

t = 1 T min { 1 Y t , 1 } = # { t [ T ] : Y t = 0 } + t [ T ] : Y t > 0 1 Y t , superscript subscript 𝑡 1 𝑇 1 subscript 𝑌 𝑡 1 # conditional-set 𝑡 delimited-[] 𝑇 subscript 𝑌 𝑡 0 subscript : 𝑡 delimited-[] 𝑇 subscript 𝑌 𝑡 0 1 subscript 𝑌 𝑡 \sum_{t=1}^{T}\min\left\{\frac{1}{Y_{t}},1\right\}=\#\{t\in[T]:Y_{t}=0\}+\sum_{t\in[T]:Y_{t}>0}\frac{1}{Y_{t}}, (F.1)

and with probability at least 1 δ 1 𝛿 1-\delta , the following event 0 subscript 0 \mathcal{E}_{0} holds:

t 0 := # { t [ T ] : Y t = 0 } log ( δ ) log ( 1 ρ ) ρ 1 log ( 1 / δ ) , assign subscript 𝑡 0 # conditional-set 𝑡 delimited-[] 𝑇 subscript 𝑌 𝑡 0 𝛿 1 𝜌 superscript 𝜌 1 1 𝛿 t_{0}:=\#\{t\in[T]:Y_{t}=0\}\leq\frac{\log(\delta)}{\log(1-\rho)}\leq\rho^{-1}\log(1/\delta),

where the first inequality results from the property of Bernuolli random variable, and the second inequality uses fact that log ( 1 x ) x 1 𝑥 𝑥 \log(1-x)\leq-x for all x 1 𝑥 1 x\leq 1 . For notational simplicy, we write { t [ T ] : Y t > 0 } = { t 0 , , t 0 + 2 N T 1 } conditional-set 𝑡 delimited-[] 𝑇 subscript 𝑌 𝑡 0 subscript 𝑡 0 subscript 𝑡 0 superscript 2 subscript 𝑁 𝑇 1 \{t\in[T]:Y_{t}>0\}=\{t_{0},\dots,t_{0}+2^{N_{T}}-1\} . With probability at least 1 δ 1 𝛿 1-\delta , the following event n subscript 𝑛 \mathcal{E}_{n} holds:

Y t 0 + 2 n = τ = 1 t 0 + 2 n X t = τ = t 0 + 1 t 0 + 2 n X t 2 n ρ 2 n 1 log ( 1 / δ ) . subscript 𝑌 subscript 𝑡 0 superscript 2 𝑛 superscript subscript 𝜏 1 subscript 𝑡 0 superscript 2 𝑛 subscript 𝑋 𝑡 superscript subscript 𝜏 subscript 𝑡 0 1 subscript 𝑡 0 superscript 2 𝑛 subscript 𝑋 𝑡 superscript 2 𝑛 𝜌 superscript 2 𝑛 1 1 𝛿 Y_{t_{0}+2^{n}}=\sum_{\tau=1}^{t_{0}+2^{n}}X_{t}=\sum_{\tau=t_{0}+1}^{t_{0}+2^{n}}X_{t}\geq 2^{n}\rho-\sqrt{2^{n-1}\log(1/\delta)}. (F.2)

based on the Hoeffding inequality. Suppose that { n } n [ N T ] subscript subscript 𝑛 𝑛 delimited-[] subscript 𝑁 𝑇 \{\mathcal{E}_{n}\}_{n\in[N_{T}]} holds, then we have

t [ T ] : Y t > 0 1 Y t = n = 0 N T t = t 0 + 2 n 2 n + 1 1 1 Y t n = 0 N T 2 n Y t 0 + 2 n n = 0 N T 2 n max { 2 n ρ 2 n 1 log ( 1 / δ ) , 1 } . subscript : 𝑡 delimited-[] 𝑇 subscript 𝑌 𝑡 0 1 subscript 𝑌 𝑡 superscript subscript 𝑛 0 subscript 𝑁 𝑇 superscript subscript 𝑡 subscript 𝑡 0 superscript 2 𝑛 superscript 2 𝑛 1 1 1 subscript 𝑌 𝑡 superscript subscript 𝑛 0 subscript 𝑁 𝑇 superscript 2 𝑛 subscript 𝑌 subscript 𝑡 0 superscript 2 𝑛 superscript subscript 𝑛 0 subscript 𝑁 𝑇 superscript 2 𝑛 superscript 2 𝑛 𝜌 superscript 2 𝑛 1 1 𝛿 1 \displaystyle\sum_{t\in[T]:Y_{t}>0}\frac{1}{Y_{t}}=\sum_{n=0}^{N_{T}}\sum_{t=t_{0}+2^{n}}^{2^{n+1}-1}\frac{1}{Y_{t}}\leq\sum_{n=0}^{N_{T}}\frac{2^{n}}{Y_{t_{0}+2^{n}}}\leq\sum_{n=0}^{N_{T}}\frac{2^{n}}{\max\{2^{n}\rho-\sqrt{2^{n-1}\log(1/\delta)},1\}}. (F.3)

Let n 0 = 1 + log 2 ( ρ 2 log ( 1 / δ ) ) subscript 𝑛 0 1 subscript 2 superscript 𝜌 2 1 𝛿 n_{0}=1+\lceil\log_{2}(\rho^{-2}\log(1/\delta))\rceil such that ρ log ( 1 / δ ) / 2 n + 1 ρ / 2 𝜌 1 𝛿 superscript 2 𝑛 1 𝜌 2 \rho-\sqrt{\log(1/\delta)/2^{n+1}}\geq\rho/2 . Following ( F.3 ), it holds

t [ T ] : Y t > 0 1 Y t n = 0 n 0 2 n + n = n 0 + 1 N T 2 ρ 1 2 n 0 + 1 + 2 ρ 1 N T 8 ρ 2 log ( 1 / δ ) + 4 ρ 1 log T . subscript : 𝑡 delimited-[] 𝑇 subscript 𝑌 𝑡 0 1 subscript 𝑌 𝑡 superscript subscript 𝑛 0 subscript 𝑛 0 superscript 2 𝑛 superscript subscript 𝑛 subscript 𝑛 0 1 subscript 𝑁 𝑇 2 superscript 𝜌 1 superscript 2 subscript 𝑛 0 1 2 superscript 𝜌 1 subscript 𝑁 𝑇 8 superscript 𝜌 2 1 𝛿 4 superscript 𝜌 1 𝑇 \displaystyle\sum_{t\in[T]:Y_{t}>0}\frac{1}{Y_{t}}\leq\sum_{n=0}^{n_{0}}2^{n}+\sum_{n=n_{0}+1}^{N_{T}}2\rho^{-1}\leq 2^{n_{0}+1}+2\rho^{-1}N_{T}\leq 8\rho^{-2}\log(1/\delta)+4\rho^{-1}\log T. (F.4)

Combine ( F.2 ) and ( F.4 ), by taking a union bound over 0 , , N T subscript 0 subscript subscript 𝑁 𝑇 \mathcal{E}_{0},\dots,\mathcal{E}_{N_{T}} , then we can get

t = 1 T min { 1 Y t , 1 } superscript subscript 𝑡 1 𝑇 1 subscript 𝑌 𝑡 1 \displaystyle\sum_{t=1}^{T}\min\left\{\frac{1}{Y_{t}},1\right\} 8 ρ 2 log ( 2 N T / δ ) + 4 ρ 1 log ( 2 T N T / δ ) absent 8 superscript 𝜌 2 2 subscript 𝑁 𝑇 𝛿 4 superscript 𝜌 1 2 𝑇 subscript 𝑁 𝑇 𝛿 \displaystyle\leq 8\rho^{-2}\log(2N_{T}/\delta)+4\rho^{-1}\log(2TN_{T}/\delta)
8 ρ 2 log ( 4 log T / δ ) + 4 ρ 1 log ( 4 T log T / δ ) 𝒪 ( ρ 1 log ( T log T / δ ) ) , absent 8 superscript 𝜌 2 4 𝑇 𝛿 4 superscript 𝜌 1 4 𝑇 𝑇 𝛿 𝒪 superscript 𝜌 1 𝑇 𝑇 𝛿 \displaystyle\leq 8\rho^{-2}\log(4\log T/\delta)+4\rho^{-1}\log(4T\log T/\delta)\leq\mathcal{O}(\rho^{-1}\log(T\log T/\delta)),

where we use the fact that log 2 T 2 log T subscript 2 𝑇 2 𝑇 \log_{2}T\leq 2\log T , and then we finish the proof of Lemma F.5 . ∎